Linux-kselftest Archive on lore.kernel.org
 help / color / Atom feed
From: Yonghong Song <yhs@fb.com>
To: Alan Maguire <alan.maguire@oracle.com>
Cc: "ast@kernel.org" <ast@kernel.org>,
	"daniel@iogearbox.net" <daniel@iogearbox.net>,
	Martin Lau <kafai@fb.com>, Song Liu <songliubraving@fb.com>,
	"davem@davemloft.net" <davem@davemloft.net>,
	"jakub.kicinski@netronome.com" <jakub.kicinski@netronome.com>,
	"hawk@kernel.org" <hawk@kernel.org>,
	"john.fastabend@gmail.com" <john.fastabend@gmail.com>,
	"rostedt@goodmis.org" <rostedt@goodmis.org>,
	"mingo@redhat.com" <mingo@redhat.com>,
	"quentin.monnet@netronome.com" <quentin.monnet@netronome.com>,
	Andrey Ignatov <rdna@fb.com>, "joe@wand.net.nz" <joe@wand.net.nz>,
	"acme@redhat.com" <acme@redhat.com>,
	"jolsa@kernel.org" <jolsa@kernel.org>,
	"alexey.budankov@linux.intel.com"
	<alexey.budankov@linux.intel.com>,
	"gregkh@linuxfoundation.org" <gregkh@linuxfoundation.org>,
	"namhyung@kernel.org" <namhyung@kernel.org>,
	"sdf@google.com" <sdf@google.com>,
	"f.fainelli@gmail.com" <f.fainelli@gmail.com>,
	"shuah@kernel.org" <shuah@kernel.org>,
	"peter@lekensteyn.nl" <peter@lekensteyn.nl>,
	"ivan@cloudflare.com" <ivan@cloudflare.com>,
	Andrii Nakryiko <andriin@fb.com>,
	"bhole_prashant_q7@lab.ntt.co.jp"
	<bhole_prashant_q7@lab.ntt.co.jp>,
	"david.calavera@gmail.com" <david.calavera@gmail.com>,
	"danieltimlee@gmail.com" <danieltimlee@gmail.com>,
	Takshak Chahande <ctakshak@fb.com>,
	"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
	"bpf@vger.kernel.org" <bpf@vger.kernel.org>,
	"linux-kselftest@vger.kernel.org"
	<linux-kselftest@vger.kernel.org>,
	"toke@redhat.com" <toke@redhat.com>,
	"jbenc@redhat.com" <jbenc@redhat.com>
Subject: Re: [RFC bpf-next 2/7] bpf: extend bpf_pcap support to tracing programs
Date: Tue, 10 Sep 2019 07:43:04 +0000
Message-ID: <cc514705-9974-9328-b66e-b57cfa61d417@fb.com> (raw)
In-Reply-To: <alpine.LRH.2.20.1909092236490.10757@dhcp-10-175-172-139.vpn.oracle.com>



On 9/9/19 11:25 PM, Alan Maguire wrote:
> On Sun, 8 Sep 2019, Yonghong Song wrote:
>   
>> For net side bpf_perf_event_output, we have
>> static unsigned long bpf_skb_copy(void *dst_buff, const void *skb,
>>                                     unsigned long off, unsigned long len)
>> {
>>           void *ptr = skb_header_pointer(skb, off, len, dst_buff);
>>
>>           if (unlikely(!ptr))
>>                   return len;
>>           if (ptr != dst_buff)
>>                   memcpy(dst_buff, ptr, len);
>>
>>           return 0;
>> }
>>
>> BPF_CALL_5(bpf_skb_event_output, struct sk_buff *, skb, struct bpf_map
>> *, map,
>>              u64, flags, void *, meta, u64, meta_size)
>> {
>>           u64 skb_size = (flags & BPF_F_CTXLEN_MASK) >> 32;
>>
>>           if (unlikely(flags & ~(BPF_F_CTXLEN_MASK | BPF_F_INDEX_MASK)))
>>                   return -EINVAL;
>>           if (unlikely(skb_size > skb->len))
>>                   return -EFAULT;
>>
>>           return bpf_event_output(map, flags, meta, meta_size, skb, skb_size,
>>                                   bpf_skb_copy);
>> }
>>
>> It does not really consider output all the frags.
>> I understand that to get truly all packet data, frags should be
>> considered, but seems we did not do it before? I am wondering
>> whether we need to do here.
> 
> Thanks for the feedback! In experimenting with packet capture,
> my original hope was to keep things simple and avoid fragment parsing
> if possible. However if scatter-gather is enabled for the networking
> device, or indeed if it's running in a VM it turns out a lot of the
> interesting packet data ends up in the fragments on transmit (ssh
> headers, http headers etc).  So I think it would be worth considering
> adding support for fragment traversal.  It's not needed as much
> in the skb program case - we can always pullup the skb - but in
> the tracing situation we probably wouldn't want to do something
> that invasive in tracing context.

Agree that in tracing context, we should avoid push/pull skb. It is
indeed invasive.

> 
> Fragment traversal might be worth breaking out as a separate patchset,
> perhaps triggered by a specific flag to bpf_skb_event_output?

This can be done for bpf_skb_event_output as the context is a sk_buff.
And you can just follow the frags to copy the whole thing without
bpf_probe_read().

> 
> Feedback from folks at Linux Plumbers (I hope I'm summarizing correctly)
> seemed to agree with what you mentioned WRT the first patch in this
> series.  The gist was we probably don't want to force the metadata to be a
> specific packet capture type; we'd rather use the existing perf event
> mechanisms and if we are indeed doing packet capture, simply specify that
> data in the program as metadata.

Agree, you can have whatever metadata you have for bpf_perf_event_output.

> 
> I'd be happy with that approach myself if I could capture skb
> fragments in tracing programs - being able to do that would give
> equivalent functionality to what I proposed but without having a packet
> capture-specific helper.

That won't work for tracing program. Full of bpf_probe_read()
in tracing version of packet copying is not nice either.

We may still need a different helper for tracing programs.

I think we need something like below:
   - vmlinux BTF at /sys/kernel/btf/kernel, is loaded into kernel.
     (/sys/kernel/btf/kernel is the source of truth)
   - For a tracing bpf program, if that function eventually
     copy  helper
         bpf_skb_event_output(..., skb, ...)
     the verifier needs to verify skb is indeed a valid skb
     by tracing back to one of parameters.

     Here, I use skb as an example, maybe it can be extended
     to other data structures as well.

With this approach, you can reuse some of functions from
tracing side to deal with frag copying and no bpf_probe_read()
is needed.

Here, I use skb as an example, maybe it can be extended
to other data structures as well if needed.

>>
>> If we indeed do not need to handle frags here, I think maybe
>> bpf_probe_read() in existing bpf kprobe function should be
>> enough, we do not need this helper?
>>
> 
> Certainly for many use cases, that will get you most of what you need -
> particularly if you're just looking at L2 to L4 data. For full packet
> capture however I think we may need to think about fragment traversal.
> 
>>> +
>>> +/* Derive protocol for some of the easier cases.  For tracing, a probe point
>>> + * may be dealing with packets in various states. Common cases are IP
>>> + * packets prior to adding MAC header (_PCAP_TYPE_IP) and a full packet
>>> + * (_PCAP_TYPE_ETH).  For other cases the caller must specify the
>>> + * protocol they expect.  Other heuristics for packet identification
>>> + * should be added here as needed, since determining the packet type
>>> + * ensures we do not capture packets that fail to match the desired
>>> + * pcap type in BPF_F_PCAP_STRICT_TYPE mode.
>>> + */
>>> +static inline int bpf_skb_protocol_get(struct sk_buff *skb)
>>> +{
>>> +	switch (htons(skb->protocol)) {
>>> +	case ETH_P_IP:
>>> +	case ETH_P_IPV6:
>>> +		if (skb_network_header(skb) == skb->data)
>>> +			return BPF_PCAP_TYPE_IP;
>>> +		else
>>> +			return BPF_PCAP_TYPE_ETH;
>>> +	default:
>>> +		return BPF_PCAP_TYPE_UNSET;
>>> +	}
>>> +}
>>> +
>>> +BPF_CALL_5(bpf_trace_pcap, void *, data, u32, size, struct bpf_map *, map,
>>> +	   int, protocol_wanted, u64, flags)
>>
>> Up to now, for helpers, verifier has a way to verifier it is used
>> properly regarding to the context. For example, for xdp version
>> perf_event_output, the help prototype,
>>     BPF_CALL_5(bpf_xdp_event_output, struct xdp_buff *, xdp, struct
>> bpf_map *, map,
>>              u64, flags, void *, meta, u64, meta_size)
>> the verifier is able to guarantee that the first parameter
>> has correct type xdp_buff, not something from type cast.
>>     .arg1_type      = ARG_PTR_TO_CTX,
>>
>> This helper, in the below we have
>>     .arg1_type	= ARG_ANYTHING,
>>
>> So it is not really enforced. Bringing BTF can help, but type
>> name matching typically bad.
>>
>>
> One thing we were discussing - and I think this is similar to what
> you're suggesting - is to investigate if there might be a way to
> leverage BTF to provide additional guarantees that the tracing
> data we are handling is indeed an skb.  Specifically if we
> trace a kprobe function argument or a tracepoint function, and
> if we had that guarantee, we could perhaps invoke the skb-style
> perf event output function (trace both the skb data and the metadata).
> The challenge would be how to do that type-based matching; we'd
> need the function argument information from BTF _and_ need to
> somehow associate it at probe attach time.
> 
> Thanks again for looking at the code!
> 
> Alan
> 

  reply index

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-07 21:40 [RFC bpf-next 0/7] bpf: packet capture helpers, bpftool support Alan Maguire
2019-09-07 21:40 ` [RFC bpf-next 1/7] bpf: add bpf_pcap() helper to simplify packet capture Alan Maguire
2019-09-08 22:02   ` Yonghong Song
2019-09-07 21:40 ` [RFC bpf-next 2/7] bpf: extend bpf_pcap support to tracing programs Alan Maguire
2019-09-08 22:18   ` Yonghong Song
2019-09-09 22:25     ` Alan Maguire
2019-09-10  7:43       ` Yonghong Song [this message]
2019-09-07 21:40 ` [RFC bpf-next 3/7] bpf: sync tools/include/uapi/linux/bpf.h for pcap support Alan Maguire
2019-09-07 21:40 ` [RFC bpf-next 4/7] bpf: add libpcap feature test Alan Maguire
2019-09-07 21:40 ` [RFC bpf-next 5/7] bpf: add pcap support to bpftool Alan Maguire
2019-09-07 21:40 ` [RFC bpf-next 6/7] bpf: add documentation for bpftool pcap subcommand Alan Maguire
2019-09-07 21:40 ` [RFC bpf-next 7/7] bpf: add tests for bpftool packet capture Alan Maguire

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cc514705-9974-9328-b66e-b57cfa61d417@fb.com \
    --to=yhs@fb.com \
    --cc=acme@redhat.com \
    --cc=alan.maguire@oracle.com \
    --cc=alexey.budankov@linux.intel.com \
    --cc=andriin@fb.com \
    --cc=ast@kernel.org \
    --cc=bhole_prashant_q7@lab.ntt.co.jp \
    --cc=bpf@vger.kernel.org \
    --cc=ctakshak@fb.com \
    --cc=daniel@iogearbox.net \
    --cc=danieltimlee@gmail.com \
    --cc=davem@davemloft.net \
    --cc=david.calavera@gmail.com \
    --cc=f.fainelli@gmail.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=hawk@kernel.org \
    --cc=ivan@cloudflare.com \
    --cc=jakub.kicinski@netronome.com \
    --cc=jbenc@redhat.com \
    --cc=joe@wand.net.nz \
    --cc=john.fastabend@gmail.com \
    --cc=jolsa@kernel.org \
    --cc=kafai@fb.com \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=namhyung@kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=peter@lekensteyn.nl \
    --cc=quentin.monnet@netronome.com \
    --cc=rdna@fb.com \
    --cc=rostedt@goodmis.org \
    --cc=sdf@google.com \
    --cc=shuah@kernel.org \
    --cc=songliubraving@fb.com \
    --cc=toke@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-kselftest Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-kselftest/0 linux-kselftest/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-kselftest linux-kselftest/ https://lore.kernel.org/linux-kselftest \
		linux-kselftest@vger.kernel.org linux-kselftest@archiver.kernel.org
	public-inbox-index linux-kselftest

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kselftest


AGPL code for this site: git clone https://public-inbox.org/ public-inbox