All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next] doc: document MSG_ZEROCOPY
@ 2017-08-31 21:00 Willem de Bruijn
  2017-09-01  2:10 ` Alexei Starovoitov
  2017-09-01 12:44 ` Jesper Dangaard Brouer
  0 siblings, 2 replies; 8+ messages in thread
From: Willem de Bruijn @ 2017-08-31 21:00 UTC (permalink / raw)
  To: netdev; +Cc: davem, Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Documentation for this feature was missing from the patchset.
Copied a lot from the netdev 2.1 paper, addressing some small
interface changes since then.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 Documentation/networking/msg_zerocopy.rst | 254 ++++++++++++++++++++++++++++++
 1 file changed, 254 insertions(+)
 create mode 100644 Documentation/networking/msg_zerocopy.rst

diff --git a/Documentation/networking/msg_zerocopy.rst b/Documentation/networking/msg_zerocopy.rst
new file mode 100644
index 000000000000..2e2a3410b947
--- /dev/null
+++ b/Documentation/networking/msg_zerocopy.rst
@@ -0,0 +1,254 @@
+
+============
+MSG_ZEROCOPY
+============
+
+Intro
+=====
+
+The MSG_ZEROCOPY flag enables copy avoidance for socket send calls.
+The feature is currently implemented for TCP sockets.
+
+
+Opportunity and Caveats
+-----------------------
+
+Copying large buffers between user process and kernel can be
+expensive. Linux supports various interfaces that eschew copying,
+such as sendpage and splice. The MSG_ZEROCOPY flag extends the
+underlying copy avoidance mechanism to common socket send calls.
+
+Copy avoidance is not a free lunch. As implemented, with page pinning,
+it replaces per byte copy cost with page accounting and completion
+notification overhead. As a result, MSG_ZEROCOPY is generally only
+effective at writes over around 10 KB.
+
+Page pinning also changes system call semantics. It temporarily shares
+the buffer between process and network stack. Unlike with copying, the
+process cannot immediately overwrite the buffer after system call
+return without possibly modifying the data in flight. Kernel integrity
+is not affected, but a buggy program can possibly corrupt its own data
+stream.
+
+The kernel returns a notification when it is safe to modify data.
+Converting an existing application to MSG_ZEROCOPY is not always as
+trivial as just passing the flag, then.
+
+
+More Info
+---------
+
+Much of this document was derived from a longer paper presented at
+netdev 2.1. For more in-depth information see that paper and talk,
+the excellent reporting over at LWN.net or read the original code.
+
+  paper, slides, video
+    https://netdevconf.org/2.1/session.html?debruijn
+
+  LWN article
+    https://lwn.net/Articles/726917/
+
+  patchset
+    [PATCH net-next v4 0/9] socket sendmsg MSG_ZEROCOPY
+    https://lwn.net/Articles/730010/
+    https://www.spinics.net/lists/netdev/msg447552.html
+
+
+Interface
+=========
+
+Passing the MSG_ZEROCOPY flag is the most obvious step to enable copy
+avoidance, but not the only one.
+
+Socket Setup
+------------
+
+The kernel is permissive when applications pass undefined flags to the
+send system call. By default it simply ignores these. To avoid enabling
+copy avoidance mode for legacy processes that accidentally pass this
+flag, a process must first signal intent by setting a socket option:
+
+::
+
+	if (setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &one, sizeof(one)))
+		error(1, errno, "setsockopt zerocopy");
+
+
+Transmission
+------------
+
+The change to send (or sendto, sendmsg, sendmmsg) itself is trivial.
+Pass the new flag.
+
+::
+
+	ret = send(fd, buf, sizeof(buf), MSG_ZEROCOPY);
+	if (ret != sizeof(buf))
+		error(1, errno, "send");
+
+
+Mixing copy avoidance and copying
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Many workloads have a mixture of large and small buffers. Because copy
+avoidance is more expensive than copying for small packets, the
+feature is implemented as a flag. It is safe to mix calls with the flag
+with those without.
+
+
+Notifications
+-------------
+
+The kernel has to notify the process when it is safe to reuse a
+previously passed buffer. It queues completion notifications on the
+socket error queue, akin to the transmit timestamping interface.
+
+The notification itself is a simple scalar value. Each socket
+maintains an internal 32-bit counter. Each send call with MSG_ZEROCOPY
+that successfully sends data increments the counter. The counter is
+not incremented on failure or if called with length zero.
+
+
+Notification Reception
+~~~~~~~~~~~~~~~~~~~~~~
+
+The below snippet demonstrates the API. In the simplest case, each
+send syscall is followed by a poll and recvmsg on the error queue.
+
+Reading from the error queue is always a non-blocking operation. The
+poll call is there to block until an error is outstanding. It will set
+POLLERR in its output flags. That flag does not have to be set in the
+events field: errors are signaled unconditionally.
+
+::
+
+	pfd.fd = fd;
+	pfd.events = 0;
+	if (poll(&pfd, 1, -1) != 1 || pfd.revents & POLLERR == 0)
+		error(1, errno, "poll");
+
+	ret = recvmsg(fd, &msg, MSG_ERRQUEUE);
+	if (ret == -1)
+		error(1, errno, "recvmsg");
+
+	read_notification(msg);
+
+The example is for demonstration purpose only. In practice, it is more
+efficient to not wait for notifications, but read without blocking
+every couple of send calls.
+
+Notifications can be processed out of order with other operations on
+the socket. A socket that has an error queued would normally block
+other operations until the error is read. Zerocopy notifications have
+a zero error code, however, to not block send and recv calls.
+
+
+Notification Batching
+~~~~~~~~~~~~~~~~~~~~~
+
+Multiple outstanding packets can be read at once using the recvmmsg
+call. This is often not needed. In each message the kernel returns not
+a single value, but a range. It coalesces consecutive notifications
+while one is outstanding for reception on the error queue.
+
+When a new notification is about to be queued, it checks whether the
+new value extends the range of the notification at the tail of the
+queue. If so, it drops the new notification packet and instead increases
+the range upper value of the outstanding notification.
+
+For protocols that acknowledge data in-order, like TCP, each
+notification can be squashed into the previous one, so that no more
+than one notification is outstanding at any one point.
+
+Ordered delivery is the common case, but not guaranteed. Notifications
+may arrive out of order on retransmission and socket teardown.
+
+
+Notification Parsing
+~~~~~~~~~~~~~~~~~~~~
+
+The below snippet demonstrates how to parse the control message: the
+read_notification() call in the previous snippet. A notification
+is encoded in the standard error format, sock_extended_err.
+
+The level and type fields in the control data are protocol family
+specific, IP_RECVERR or IPV6_RECVERR.
+
+Error origin is the new type SO_EE_ORIGIN_ZEROCOPY. The errno is zero,
+as explained before, to avoid blocking read and write system calls on
+the socket.
+
+The 32-bit notification range is encoded as [ee_info, ee_data]. This
+range is inclusive. Other fields in the struct must be treated as
+undefined, bar for ee_code, as discussed below.
+
+::
+
+	struct sock_extended_err *serr;
+	struct cmsghdr *cm;
+
+	cm = CMSG_FIRSTHDR(msg);
+	if (cm->cmsg_level != SOL_IP &&
+	    cm->cmsg_type != IP_RECVERR)
+		error(1, 0, "cmsg");
+
+	serr = (void *) CMSG_DATA(cm);
+	if (serr->ee_errno != 0 ||
+	    serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY)
+		error(1, 0, "serr");
+
+	printf("completed: %u..%u\n", serr->ee_info, serr->ee_data);
+
+
+Deferred copies
+~~~~~~~~~~~~~~~
+
+Passing flag MSG_ZEROCOPY is a hint to the kernel to apply copy
+avoidance, and a contract that the kernel will queue a completion
+notification. It is not a guarantee that the copy is elided.
+
+Copy avoidance is not always feasible. Devices that do not support
+scatter-gather I/O cannot send packets made up of kernel generated
+protocol headers plus zerocopy user data. A packet may need to be
+converted to having a private copy of data deep in the stack, say to
+compute a checksum.
+
+In all these cases, the kernel returns a completion notification when
+it releases its hold on the shared pages. The notification may arrive
+before the (copied) data is fully transmitted. A zerocopy completion
+notification is not a transmit completion notification, therefore.
+
+Deferred copies can be more expensive than a copy immediately in the
+system call, if the data is no longer warm in the cache. The process
+also incurs notification processing cost for no benefit. For this
+reason, the kernel signals if data was completed with a copy, by
+setting flag SO_EE_CODE_ZEROCOPY_COPIED in field ee_code on return.
+A process may use this signal to stop passing flag MSG_ZEROCOPY on
+subsequent requests on the same socket.
+
+Implementation
+==============
+
+Loopback
+--------
+
+Data sent to local sockets can be queued indefinitely if the receive
+process does not read its socket. Unbound notification latency is not
+acceptable. For this reason all packets generated with MSG_ZEROCOPY
+that are looped to a local socket will incur a deferred copy. This
+includes looping onto packet sockets (e.g., tcpdump) and tun devices.
+
+
+Testing
+=======
+
+More realistic example code can be found in the kernel source under
+tools/testing/selftests/net/msg_zerocopy.c.
+
+Be cognizant of the loopback constraint. The test can be run between
+a pair of hosts. But if run between a local pair of processes, for
+instance when run with msg_zerocopy.sh between a veth pair across
+namespaces, the test will not show any improvement. For testing, the
+loopback restriction can be temporarily avoided by making
+skb_orphan_frags_rx identical to skb_orphan_frags.
+
-- 
2.14.1.581.gf28d330327-goog

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH net-next] doc: document MSG_ZEROCOPY
  2017-08-31 21:00 [PATCH net-next] doc: document MSG_ZEROCOPY Willem de Bruijn
@ 2017-09-01  2:10 ` Alexei Starovoitov
  2017-09-01  3:04   ` Willem de Bruijn
  2017-09-01 12:44 ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 8+ messages in thread
From: Alexei Starovoitov @ 2017-09-01  2:10 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: netdev, davem, Willem de Bruijn

On Thu, Aug 31, 2017 at 05:00:13PM -0400, Willem de Bruijn wrote:
> From: Willem de Bruijn <willemb@google.com>
> 
> Documentation for this feature was missing from the patchset.
> Copied a lot from the netdev 2.1 paper, addressing some small
> interface changes since then.
> 
> Signed-off-by: Willem de Bruijn <willemb@google.com>
...
> +Notification Batching
> +~~~~~~~~~~~~~~~~~~~~~
> +
> +Multiple outstanding packets can be read at once using the recvmmsg
> +call. This is often not needed. In each message the kernel returns not
> +a single value, but a range. It coalesces consecutive notifications
> +while one is outstanding for reception on the error queue.
> +
> +When a new notification is about to be queued, it checks whether the
> +new value extends the range of the notification at the tail of the
> +queue. If so, it drops the new notification packet and instead increases
> +the range upper value of the outstanding notification.

Would it make sense to mention that max notification range is 32-bit?
So each 4Gbyte of xmit bytes there will be a notification.
In modern 40Gbps NICs it's not a lot. Means that there will be
at least one notification every second.
Or I misread the code?

Thanks for the doc!

Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH net-next] doc: document MSG_ZEROCOPY
  2017-09-01  2:10 ` Alexei Starovoitov
@ 2017-09-01  3:04   ` Willem de Bruijn
  2017-09-01  3:10     ` Alexei Starovoitov
  0 siblings, 1 reply; 8+ messages in thread
From: Willem de Bruijn @ 2017-09-01  3:04 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: Network Development, David Miller, Willem de Bruijn

On Thu, Aug 31, 2017 at 10:10 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Thu, Aug 31, 2017 at 05:00:13PM -0400, Willem de Bruijn wrote:
>> From: Willem de Bruijn <willemb@google.com>
>>
>> Documentation for this feature was missing from the patchset.
>> Copied a lot from the netdev 2.1 paper, addressing some small
>> interface changes since then.
>>
>> Signed-off-by: Willem de Bruijn <willemb@google.com>
> ...
>> +Notification Batching
>> +~~~~~~~~~~~~~~~~~~~~~
>> +
>> +Multiple outstanding packets can be read at once using the recvmmsg
>> +call. This is often not needed. In each message the kernel returns not
>> +a single value, but a range. It coalesces consecutive notifications
>> +while one is outstanding for reception on the error queue.
>> +
>> +When a new notification is about to be queued, it checks whether the
>> +new value extends the range of the notification at the tail of the
>> +queue. If so, it drops the new notification packet and instead increases
>> +the range upper value of the outstanding notification.
>
> Would it make sense to mention that max notification range is 32-bit?
> So each 4Gbyte of xmit bytes there will be a notification.
> In modern 40Gbps NICs it's not a lot. Means that there will be
> at least one notification every second.
> Or I misread the code?

You're right. The doc does mention that the counter and range
are 32-bit. I can state more explicitly that that bounds the working
set size to 4GB. Do you expect this to be problematic? Processing
a single notification per 4GB of data should not be a significant
cost in itself.

> Thanks for the doc!

Thanks for reviewing :)

>
> Acked-by: Alexei Starovoitov <ast@kernel.org>
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH net-next] doc: document MSG_ZEROCOPY
  2017-09-01  3:04   ` Willem de Bruijn
@ 2017-09-01  3:10     ` Alexei Starovoitov
  2017-09-01  3:31       ` Willem de Bruijn
  2017-09-01 15:47       ` Willem de Bruijn
  0 siblings, 2 replies; 8+ messages in thread
From: Alexei Starovoitov @ 2017-09-01  3:10 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: Network Development, David Miller, Willem de Bruijn

On Thu, Aug 31, 2017 at 11:04:41PM -0400, Willem de Bruijn wrote:
> On Thu, Aug 31, 2017 at 10:10 PM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> > On Thu, Aug 31, 2017 at 05:00:13PM -0400, Willem de Bruijn wrote:
> >> From: Willem de Bruijn <willemb@google.com>
> >>
> >> Documentation for this feature was missing from the patchset.
> >> Copied a lot from the netdev 2.1 paper, addressing some small
> >> interface changes since then.
> >>
> >> Signed-off-by: Willem de Bruijn <willemb@google.com>
> > ...
> >> +Notification Batching
> >> +~~~~~~~~~~~~~~~~~~~~~
> >> +
> >> +Multiple outstanding packets can be read at once using the recvmmsg
> >> +call. This is often not needed. In each message the kernel returns not
> >> +a single value, but a range. It coalesces consecutive notifications
> >> +while one is outstanding for reception on the error queue.
> >> +
> >> +When a new notification is about to be queued, it checks whether the
> >> +new value extends the range of the notification at the tail of the
> >> +queue. If so, it drops the new notification packet and instead increases
> >> +the range upper value of the outstanding notification.
> >
> > Would it make sense to mention that max notification range is 32-bit?
> > So each 4Gbyte of xmit bytes there will be a notification.
> > In modern 40Gbps NICs it's not a lot. Means that there will be
> > at least one notification every second.
> > Or I misread the code?
> 
> You're right. The doc does mention that the counter and range
> are 32-bit. I can state more explicitly that that bounds the working
> set size to 4GB. Do you expect this to be problematic? Processing
> a single notification per 4GB of data should not be a significant
> cost in itself.

I think 4GB is fine. Just there was an idea that in cases when
notification of transmission can be known by other means the user space
could have skipped reading errqeuee completely, but looks like it
still needs to poll. That's fine.

> > Thanks for the doc!
> 
> Thanks for reviewing :)
> 
> >
> > Acked-by: Alexei Starovoitov <ast@kernel.org>
> >

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH net-next] doc: document MSG_ZEROCOPY
  2017-09-01  3:10     ` Alexei Starovoitov
@ 2017-09-01  3:31       ` Willem de Bruijn
  2017-09-01 15:47       ` Willem de Bruijn
  1 sibling, 0 replies; 8+ messages in thread
From: Willem de Bruijn @ 2017-09-01  3:31 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: Network Development, David Miller, Willem de Bruijn

On Thu, Aug 31, 2017 at 11:10 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Thu, Aug 31, 2017 at 11:04:41PM -0400, Willem de Bruijn wrote:
>> On Thu, Aug 31, 2017 at 10:10 PM, Alexei Starovoitov
>> <alexei.starovoitov@gmail.com> wrote:
>> > On Thu, Aug 31, 2017 at 05:00:13PM -0400, Willem de Bruijn wrote:
>> >> From: Willem de Bruijn <willemb@google.com>
>> >>
>> >> Documentation for this feature was missing from the patchset.
>> >> Copied a lot from the netdev 2.1 paper, addressing some small
>> >> interface changes since then.
>> >>
>> >> Signed-off-by: Willem de Bruijn <willemb@google.com>
>> > ...
>> >> +Notification Batching
>> >> +~~~~~~~~~~~~~~~~~~~~~
>> >> +
>> >> +Multiple outstanding packets can be read at once using the recvmmsg
>> >> +call. This is often not needed. In each message the kernel returns not
>> >> +a single value, but a range. It coalesces consecutive notifications
>> >> +while one is outstanding for reception on the error queue.
>> >> +
>> >> +When a new notification is about to be queued, it checks whether the
>> >> +new value extends the range of the notification at the tail of the
>> >> +queue. If so, it drops the new notification packet and instead increases
>> >> +the range upper value of the outstanding notification.
>> >
>> > Would it make sense to mention that max notification range is 32-bit?
>> > So each 4Gbyte of xmit bytes there will be a notification.
>> > In modern 40Gbps NICs it's not a lot. Means that there will be
>> > at least one notification every second.
>> > Or I misread the code?
>>
>> You're right. The doc does mention that the counter and range
>> are 32-bit. I can state more explicitly that that bounds the working
>> set size to 4GB. Do you expect this to be problematic? Processing
>> a single notification per 4GB of data should not be a significant
>> cost in itself.
>
> I think 4GB is fine. Just there was an idea that in cases when
> notification of transmission can be known by other means

Some kind of unspoofable response from the peer (i.e., not just
a tcp ack), or a kernel mechanism independent from the error
queue? The first does not guarantee that a retransmit is
not in progress.

> the user space
> could have skipped reading errqeuee completely, but looks like it
> still needs to poll.

If a process has no need to see the notification, say because
it is sending out a buffer that is constant for the process lifetime,
then it could conceivably skip the recv, and poll with it. The code
as written will not coalesce more than 4GB of data, but that could
be revised.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH net-next] doc: document MSG_ZEROCOPY
  2017-08-31 21:00 [PATCH net-next] doc: document MSG_ZEROCOPY Willem de Bruijn
  2017-09-01  2:10 ` Alexei Starovoitov
@ 2017-09-01 12:44 ` Jesper Dangaard Brouer
  2017-09-01 13:37   ` Willem de Bruijn
  1 sibling, 1 reply; 8+ messages in thread
From: Jesper Dangaard Brouer @ 2017-09-01 12:44 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: brouer, netdev, davem, Willem de Bruijn

On Thu, 31 Aug 2017 17:00:13 -0400
Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:

> +More Info
> +---------
> +
> +Much of this document was derived from a longer paper presented at
> +netdev 2.1. For more in-depth information see that paper and talk,
> +the excellent reporting over at LWN.net or read the original code.
> +
> +  paper, slides, video
> +    https://netdevconf.org/2.1/session.html?debruijn
> +
> +  LWN article
> +    https://lwn.net/Articles/726917/
> +
> +  patchset
> +    [PATCH net-next v4 0/9] socket sendmsg MSG_ZEROCOPY
> +    https://lwn.net/Articles/730010/
> +    https://www.spinics.net/lists/netdev/msg447552.html

IMHO I think it would be better to use the type links also used in the
git log.  If you look at the kernel git log, then the "Link:" tag have
the form: http://lkml.kernel.org/r/
And you can simply append the "Message-Id:" email header.

In this case the link would be:
  http://lkml.kernel.org/r/20170803202945.70750-1-willemdebruijn.kernel@gmail.com
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH net-next] doc: document MSG_ZEROCOPY
  2017-09-01 12:44 ` Jesper Dangaard Brouer
@ 2017-09-01 13:37   ` Willem de Bruijn
  0 siblings, 0 replies; 8+ messages in thread
From: Willem de Bruijn @ 2017-09-01 13:37 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Network Development, David Miller, Willem de Bruijn

On Fri, Sep 1, 2017 at 8:44 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> On Thu, 31 Aug 2017 17:00:13 -0400
> Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:
>
>> +More Info
>> +---------
>> +
>> +Much of this document was derived from a longer paper presented at
>> +netdev 2.1. For more in-depth information see that paper and talk,
>> +the excellent reporting over at LWN.net or read the original code.
>> +
>> +  paper, slides, video
>> +    https://netdevconf.org/2.1/session.html?debruijn
>> +
>> +  LWN article
>> +    https://lwn.net/Articles/726917/
>> +
>> +  patchset
>> +    [PATCH net-next v4 0/9] socket sendmsg MSG_ZEROCOPY
>> +    https://lwn.net/Articles/730010/
>> +    https://www.spinics.net/lists/netdev/msg447552.html
>
> IMHO I think it would be better to use the type links also used in the
> git log.  If you look at the kernel git log, then the "Link:" tag have
> the form: http://lkml.kernel.org/r/
> And you can simply append the "Message-Id:" email header.
>
> In this case the link would be:
>   http://lkml.kernel.org/r/20170803202945.70750-1-willemdebruijn.kernel@gmail.com

I was not aware of that. Thanks, I'll send a v2.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH net-next] doc: document MSG_ZEROCOPY
  2017-09-01  3:10     ` Alexei Starovoitov
  2017-09-01  3:31       ` Willem de Bruijn
@ 2017-09-01 15:47       ` Willem de Bruijn
  1 sibling, 0 replies; 8+ messages in thread
From: Willem de Bruijn @ 2017-09-01 15:47 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: Network Development, David Miller, Willem de Bruijn

On Thu, Aug 31, 2017 at 11:10 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Thu, Aug 31, 2017 at 11:04:41PM -0400, Willem de Bruijn wrote:
>> On Thu, Aug 31, 2017 at 10:10 PM, Alexei Starovoitov
>> <alexei.starovoitov@gmail.com> wrote:
>> > On Thu, Aug 31, 2017 at 05:00:13PM -0400, Willem de Bruijn wrote:
>> >> From: Willem de Bruijn <willemb@google.com>
>> >>
>> >> Documentation for this feature was missing from the patchset.
>> >> Copied a lot from the netdev 2.1 paper, addressing some small
>> >> interface changes since then.
>> >>
>> >> Signed-off-by: Willem de Bruijn <willemb@google.com>
>> > ...
>> >> +Notification Batching
>> >> +~~~~~~~~~~~~~~~~~~~~~
>> >> +
>> >> +Multiple outstanding packets can be read at once using the recvmmsg
>> >> +call. This is often not needed. In each message the kernel returns not
>> >> +a single value, but a range. It coalesces consecutive notifications
>> >> +while one is outstanding for reception on the error queue.
>> >> +
>> >> +When a new notification is about to be queued, it checks whether the
>> >> +new value extends the range of the notification at the tail of the
>> >> +queue. If so, it drops the new notification packet and instead increases
>> >> +the range upper value of the outstanding notification.
>> >
>> > Would it make sense to mention that max notification range is 32-bit?
>> > So each 4Gbyte of xmit bytes there will be a notification.
>> > In modern 40Gbps NICs it's not a lot. Means that there will be
>> > at least one notification every second.
>> > Or I misread the code?
>>
>> You're right. The doc does mention that the counter and range
>> are 32-bit. I can state more explicitly that that bounds the working
>> set size to 4GB. Do you expect this to be problematic? Processing
>> a single notification per 4GB of data should not be a significant
>> cost in itself.

Actually, the counter is not a byte counter. It is incremented on each
system call that sends data with MSG_ZEROCOPY. So the 4GB limit
would only hold if a caller sends single byte requests at a time.

I will make this more clear in v2.

>
> I think 4GB is fine. Just there was an idea that in cases when
> notification of transmission can be known by other means the user space
> could have skipped reading errqeuee completely, but looks like it
> still needs to poll. That's fine.
>> > Thanks for the doc!
>>
>> Thanks for reviewing :)
>>
>> >
>> > Acked-by: Alexei Starovoitov <ast@kernel.org>
>> >

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2017-09-01 15:47 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-31 21:00 [PATCH net-next] doc: document MSG_ZEROCOPY Willem de Bruijn
2017-09-01  2:10 ` Alexei Starovoitov
2017-09-01  3:04   ` Willem de Bruijn
2017-09-01  3:10     ` Alexei Starovoitov
2017-09-01  3:31       ` Willem de Bruijn
2017-09-01 15:47       ` Willem de Bruijn
2017-09-01 12:44 ` Jesper Dangaard Brouer
2017-09-01 13:37   ` Willem de Bruijn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.