[0/3] net: Kill skb_copy_datagram_const_iovec
mbox series

Message ID 20141103053751.GA27845@gondor.apana.org.au
Headers show
Series
  • net: Kill skb_copy_datagram_const_iovec
Related show

Message

Herbert Xu Nov. 3, 2014, 5:37 a.m. UTC
On Mon, Nov 03, 2014 at 12:45:03AM +0000, Al Viro wrote:
>
> Note, BTW, that there's a damn good reason to convert the socket side of
> things to iov_iter - as it is, ->splice_write() there is basically done with
> page-by-page mapping and doing kernel_sendmsg(); being able to deal with
> "map and copy" stuff *inside* ->sendmsg() would not only reduce the overhead,
> it would allow to get rid of ->sendpage() completely.  Basically, let
> ->sendmsg() instances check the iov_iter type and play zerocopy games if
> it's an "array of kernel pages" kind.  Compare ->sendpage() and ->sendmsg()
> instances for the protocols that have nontrivial ->sendpage(); you'll see
> that there's a lot of duplication.  Merging them looks very feasible, with
> divergence happening only very deep in the call chain.

Honestly I don't really care which way we end up going as long as
we pick one solution and stick with it.  Right now we have an
abomination in the form of skb_copy_datagram_const_iovec which is
the worst of both worlds, plus it duplicates tons of code.

So here's a few patches to kill this crap.

Cheers,

Comments

David Miller Nov. 3, 2014, 8:05 p.m. UTC | #1
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Mon, 3 Nov 2014 13:37:51 +0800

> On Mon, Nov 03, 2014 at 12:45:03AM +0000, Al Viro wrote:
>>
>> Note, BTW, that there's a damn good reason to convert the socket side of
>> things to iov_iter - as it is, ->splice_write() there is basically done with
>> page-by-page mapping and doing kernel_sendmsg(); being able to deal with
>> "map and copy" stuff *inside* ->sendmsg() would not only reduce the overhead,
>> it would allow to get rid of ->sendpage() completely.  Basically, let
>> ->sendmsg() instances check the iov_iter type and play zerocopy games if
>> it's an "array of kernel pages" kind.  Compare ->sendpage() and ->sendmsg()
>> instances for the protocols that have nontrivial ->sendpage(); you'll see
>> that there's a lot of duplication.  Merging them looks very feasible, with
>> divergence happening only very deep in the call chain.
> 
> Honestly I don't really care which way we end up going as long as
> we pick one solution and stick with it.  Right now we have an
> abomination in the form of skb_copy_datagram_const_iovec which is
> the worst of both worlds, plus it duplicates tons of code.
> 
> So here's a few patches to kill this crap.

To pick one direction and go with it, I totally agree with.

But a patch set like this as an interim solution, I am not so happy
with.

If the method says const, we have a contract with the caller to not
modify the iovec.  That caller can assume that we have not done so.

So this patch set violated that contract and can result in real bugs
either now or in the future.

I'll see if I can make some progress converting the networking over
to iov_iter.  It can't be that difficult... albeit perhaps a little
time consuming.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Herbert Xu Nov. 4, 2014, 3:38 a.m. UTC | #2
On Mon, Nov 03, 2014 at 03:05:53PM -0500, David Miller wrote:
>
> I'll see if I can make some progress converting the networking over
> to iov_iter.  It can't be that difficult... albeit perhaps a little
> time consuming.

OK great.  I'll try to convert tun/macvtap over to iov_iter.

Thanks,
Al Viro Nov. 4, 2014, 5:45 a.m. UTC | #3
On Mon, Nov 03, 2014 at 03:05:53PM -0500, David Miller wrote:

> I'll see if I can make some progress converting the networking over
> to iov_iter.  It can't be that difficult... albeit perhaps a little
> time consuming.

FWIW, I have a queue that got started back in April; basically, the plan
of attack was
	* separate kernel-side and userland msghdr.
	* localize ->msg_iov uses - most of that gets taken care of by
several new helpers, as in
    new helper: skb_copy_datagram_msg()
    
    Absolute majority of skb_copy_datagram_iovec() callers (49 out of 56)
    are passing it msg->msg_iov as iovec.  Provide a trivial wrapper that
    takes msg as argument instead of iovec.
and several like that (the numbers in the above are probably incorrect these
days - it was done more than half a year ago).
	* switch kernel-side msghdr to iov_iter.  That means diverging
layouts; it's really not hard, since we have copying of msghdr from
userland already localized.  Initially - just a mechanical conversion
(i.e. direct uses of iov_iter fields instead of ->msg_iov/->msg_iovlen;
note that after the introduction of wrappers the number of such places
is very much reduced).
	* start converting those relatively few places to iov_iter primitives.

And that's where it got stalled, since we have to deal with expectations
of callers.  Syscall ones are trivial; that's not a problem.  Unfortunately,
there are kernel_{send,recv}msg() users, and those do care about the state the
iovec is left in.  Strictly speaking, the state of iovec after e.g.
->sendmsg() is undefined.  And it's not just protocol-dependent - unless
I'm seriously misreading it, tcp_sendmsg() ends up modifying iovec in case
when it hits tcp_send_rcvq(), while in the normal case it leaves iovec
unmodified.  So in general you need to feed ->{send,recv}msg() a throwaway copy
of iovec.  Leads to wonders like
        /* NB we can't trust socket ops to either consume our iovs
         * or leave them alone. */
        LASSERT (niov > 0);

        for (nob = i = 0; i < niov; i++) {
                scratchiov[i] = iov[i];
                nob += scratchiov[i].iov_len;
        }
        LASSERT (nob <= conn->ksnc_rx_nob_wanted);

        rc = kernel_recvmsg(conn->ksnc_sock, &msg,
                (struct kvec *)scratchiov, niov, nob, MSG_DONTWAIT);
etc.  However, there are places that don't bother and do this:
        while (total_rx < data) {
                rx_loop = kernel_recvmsg(conn->sock, &msg, iov_p, iov_len,
                                        (data - total_rx), MSG_WAITALL);
                if (rx_loop <= 0) {
                        pr_debug("rx_loop: %d total_rx: %d\n",
                                rx_loop, total_rx); 
                        return rx_loop;
                }
                total_rx += rx_loop;
                pr_debug("rx_loop: %d, total_rx: %d, data: %d\n",
                                rx_loop, total_rx, data);
        }
(that's iscsit_do_rx_data()).  Maybe it's a bug; maybe it's relying on
specific behaviour of the protocol known to be used - this code clearly
expects recvmsg to advance iovec, which seems to depend only on the
protocol.  At the moment.  In any case, it's very brittle...

Hell knows; I hadn't finished digging through that zoo - got sidetracked back
then.  *IF* all such places either use a throwaway copy or assume that iovec
gets modified, we can do the following: switch the access to iovecs to
iov_iter primitives, with the first kind of callers creating a throwaway
iov_iter and the second just feeding the same iov_iter to e.g.
kernel_recvmsg().  iovec will remain constant, iov_iter will be advanced.
Moreover, in a lot of cases of first kind will be able to get rid of
throwaway iov_iter (and of manually advancing it), effectively converting
to the second one.

If we have places that currently rely on iovec remaining unchanged (i.e.
manually advancing it after kernel_{send,recv}msg()), the series will be
more painful ;-/  I very much hope that no such places exist...

FWIW, there is also a tactical question that needs to be dealt with.  We
can, of course, start with renaming the "kernel-side" (i.e. post
copy_msghdr_from_user()/get_compat_msghdr()) to struct kmsghdr.  OTOH,
that's a _lot_ of churn for very little reason - most of the instances
in the tree are of that kind.  So I did it the other way round - introduced
struct user_msghdr (only in linux/socket.h; note that we do *not* have
struct msghdr in uabi/linux/socket.h, or anywhere else in uabi/*),
made the syscalls take pointers to it and (initially) rely upon the identical
layouts in copy_msghdr_from_user(); once we put iov_iter into kernel-side
msghdr, we'll just do it like get_compat_msghdr() does.

Is that acceptable?  It would greatly reduce the amount of churn in net/* -
we don't need to pass iov_iter separately and most of the functions in
the middle of call chains are completely unchanged.  Only the originators
of ->sendmsg()/->recvmsg() and the places doing actual data copying
need to be touched.  OTOH, it makes for kernel struct msghdr looking
odd - instead of normal ->msg_iov and ->msg_iovlen it would have
->msg_iov_iter, with ->sendmsg()/->recvmsg() callers needing to set it
up...  OTTH, the things *are* odd from userland programmer POV - sendmsg(2)
and recvmsg(2) leave the iovec unchanged, and having it changed unpredicatably
in the kernel-side counterparts seems to make for a nasty trap.  Certainly
makes for a bunch of nasty comments in the code using those...

Comments?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Al Viro Nov. 5, 2014, 1:53 a.m. UTC | #4
On Tue, Nov 04, 2014 at 05:45:13AM +0000, Al Viro wrote:

> Hell knows; I hadn't finished digging through that zoo - got sidetracked back
> then.  *IF* all such places either use a throwaway copy or assume that iovec
> gets modified, we can do the following: switch the access to iovecs to
> iov_iter primitives, with the first kind of callers creating a throwaway
> iov_iter and the second just feeding the same iov_iter to e.g.
> kernel_recvmsg().  iovec will remain constant, iov_iter will be advanced.
> Moreover, in a lot of cases of first kind will be able to get rid of
> throwaway iov_iter (and of manually advancing it), effectively converting
> to the second one.

All right, now I _have_ finished that.  See the resulting notes below.
TL;DR version: looks like hypothesis above is correct, modulo 2 places,
both buggy - cifs smb_send_kvec() apparently relies on ->sendmsg() leaving the
iovec unchanged and so does of the ceph_tcp_sendmsg() callers
(write_partial_kvec()).  For TCP that's not always true.  Another apparent
bug caught in process is iscsi iscsit_do_tx_data() - assumes that iovec is
being consumed by sendmsg().  I don't see how that could not be a bug - TCP
sockets can get there and tcp_sendmsg() normally *doesn't* modify the iovec.
Sometimes it does, unfortunately for other two places...

Maintainers Cc'd...


Full version follows:
-----------------------------------------------------------------------------
->sendmsg(): there's such method in struct proto_ops and in struct proto; the
latter is called by (some of) the former, in cases when ->sendmsg() isn't the
same for the entire family.

Instances of proto ->sendmsg() are, on several occasions, called directly; some
of those calls are from another such instance (with unchanged payload).  There
are two exceptions to that - in tipc_accept() and tipc_connect() we call such
instances with empty payload.  All calls via method are from proto_ops
->sendmsg() instances, payload unchanged.

Instances of proto_ops ->sendmsg() are almost never called directly.  All
exceptions are from another such instance with unchanged payload.

There are two places that call proto_ops ->sendmsg() via method -
__sock_sendmsg_nosec() in net/socket.c, and handle_tx() in drivers/vhost/net.c.
The latter appears to be playing somewhat unusual games with passing NULL iocb,
making it impossible to use the former...  Everybody else in the kernel goes
through __sock_sendmsg_nosec(), though - it's a chokepoint for sendmsg path.

vhost callsite is somewhat worrying - granted, most of the ->sendmsg()
instances don't give a damn about iocb at all.  The rest, though...
E.g. what happens if we do VHOST_NET_SET_BACKEND with backend.fd being an
AF_UNIX socket?  AFAICS, if it ever gets to that ->sendmsg() call afterwards,
we'll get an oops when e.g. unix_dgram_sendmsg() calls kiocb_to_siocb(NULL).
The same goes for AF_NETLINK; AF_TIPC is even more interesting, since there
NULL iocb is used as "we are in weird callchain, socket is already locked"
flag (for tipc_{accept,connect}() callsites).  What's going on there?
[After having talked with mst: drivers/vhost/net.c checks that it's
AF_PACKET/SOCK_RAW (packet_sendmsg()) *or* comes from tun.c (tun_sendmsg())
or from macvtap.c (macvtap_sendmsg()) and all of those ignore iocb completely]

FWIW, the situation with iocb is
	* AF_UNIX and AF_NETLINK use it to get to sock_iocb
	* AF_TIPC uses it as a flag - it never dereferences the damn thing and
only compares it with NULL, to determine whether it's in a normal call chain,
or in tipc_accept/tipc_connect one...
	* everything else ignores it completely, either directly or after
passing it to something that ignores it (rxrpc has unusually deep chain, but it
still ends up ignoring the sucker).

->recvmsg(): again, present in proto_ops and proto, in the same relationship
to each other.  The situation is a bit simpler there: there is only one direct
caller of an instance of struct proto ->recvmsg() and it is in another such
instance, arguments unchanged.  All callers via method are in the instances of
struct proto_ops ->recvmsg(), message-related arguments unchanged.  No direct
callers of struct proto_ops recvmsg, three call sites via method - regular one
in __sock_recvmsg_nosec() plus two in drivers/vhost/net.c.  The latter have
NULL iocb and are saved from oopsing in ->recvmsg() by the same logics that
saves vhost on sendmsg side.  iocb is ignored by everything except AF_UNIX
and AF_NETLINK (those use it for sock_iocb) and neither can be reached from
vhost path.

So it boils down to the following: drivers/vhost/net.c aside, everything goes
through __sock_sendmsg_nosec() on the sendmsg side and __sock_recvmsg_nosec()
on recvmsg one.

Call chains leading to __sock_sendmsg_nosec():

__sock_sendmsg_nosec()
	<- sock_sendmsg_nosec()
		<- ___sys_sendmsg()
			<- __sys_sendmsg()
				<- sys_compat_sendmsg()
				<- sys_sendmsg()
			<- __sys_sendmmsg()
				<- sys_compat_senmmmsg()
				<- sys_sendmmsg()
	<- __sock_sendmsg()
		<- do_sock_write()
			<- sock_aio_write()
				== ->aio_write()
		<- sock_sendmsg()
			<- svc_sendto()	[no iovec at all]
			<- ___sys_sendmsg() [see above]
			<- sys_sendto()
			<- kernel_sendmsg()
All syscalls (and there's quite a tangled mess with sys_socketcall,
assorted ARM wrappers, etc.) end up with iovec discarded.
Ditto for ->aio_write() callers - they all free the iovec soon after
->aio_write() returns and never look at it before freeing.

Call chains leading to __sock_recvmsg_nosec():

__sock_recvmsg_nosec()
	<- sock_recvmsg_nosec()
		<- ___sys_recvmsg()
			<- __sys_recvmsg()
				<- sys_compat_recvmsg()
				<- sys_recvmsg()
			<- __sys_recvmmsg()
				<- sys_compat_recvmmsg()
				<- sys_recvmmsg()
	<- __sock_recvmsg()
		<- do_sock_read()
			<- sock_aio_read()
				== ->aio_read()
		<- sock_recvmsg() [why is it not static, BTW?]
			<- ___sys_recvmsg() [see above]
			<- sys_recvfrom()
			<- kernel_recvmsg()
Again, both the sycalls and ->aio_read() callers end up discarding
iovec.

All of that leaves us with kernel_{send,recv}msg() as the next-order
chokepoints.

kernel_recvmsg() callers:
	drbd drbd_recv_short() - single-element iovec discarded
	nbd sock_xmit() - single-element iovec discarded; would be better off
with advancing iov_iter.
	isdn l1oip_socket_thread() - single-element iovec discarded
	lustre ksocknal_lib_recv_iov() - iovec copied (with unhappy comment),
copy passed to kernel_recvmsg() and discarded
	lustre ksocknal_lib_recv_kiov() - ditto.  Would be much better off
with bvec-based iov_iter
	lustre libcfs_sock_read() - single-element iovec discarded; would be
better off with advancing iov_iter.
	iscsi iscsit_do_rx_data() - assumes that iovec is being consumed.
AFAICS, it's guaranteed to be TCP and tcp_recvmsg() appears to act that way,
so it's probably OK...
	usbip usbip_recv() - single-element iovec discarded; would be better
off with advancing iov_iter
	cifs cifs_readv_from_socket() - iovec copied, copy passed to
kernel_recvmsg() and discarded; *definitely* would be better off with advancing
iov_iter.
	dlm receive_from_sock() - 1- or 2-element iovec discarded.
	ncpfs _recv() - single-element iovec discarded.
	ocfs2 o2net_recv_tcp_msg() - single-element iovec discarded.
	ceph ceph_tcp_recvmsg() - single-element iovec discarded; at least one
of the loops using it would be better off with advancing iov_iter.
	ipvs ip_vs_receive() - single-element iovec discarded.
	sunrpc svc_udp_recvfrom() - no payload (MSG_PEEK, that one)
	tipc tipc_receive_from_sock() - single-element iovec discarded.
	sunrpc svc_recvfrom() - confusing; looks like iovec is a throwaway
one, though (and we might be better off if we could use an iov_iter of bvec
sort instead).

kernel_sendmsg() callers:
	drbd drbd_send() - single-element iovec discarded; would be better
off with advancing iov_iter
	nbd sock_xmit() - ditto
	isdn l1oip_socket_send() - single-element iovec discarded
	iscsi iscsi_sw_tcp_xmit_segment() - single-element iovec discarded
	lustre ksocknal_lib_send_iov() - iovec copied (with unhappy comment),
copy passed to kernel_sendmsg() and discarded
	lustre ksocknal_lib_send_kiov - ditto.  Would be much better off
with bvec iov_iter.
	lustre libcfs_sock_write() - single-element iovec discarded; better
off with advancing iov_iter
	iscsi iscsit_do_tx_data() - assumes that iovec is being consumed.
I don't see how that could not be a bug - TCP sockets can get there.
	usbip stub_send_ret_submit() - iovec is built and discarded;
short write is treated as an error
	usbip stub_send_ret_unlink() - ditto
	usbip vhci_send_cmd_submit() - ditto
	usbip vhci_send_cmd_unlink() - ditto
	cifs smb_send_kvec() - apparently relies on ->sendmsg() leaving the
iovec unchanged.  Looks like a bug - tcp_sendmsg() might drain iovec in some
cases.
	dlm sctp_send_shutdown() - no payload
	dlm sctp_init_assoc() - single-element iovec discarded
	ncpfs do_send() - single-element iovec discarded
	ocfs2 o2net_send_tcp_msg() - callers pass it a throwaway iovec
	bnep bnep_send() - single-element iovec discarded
	bnep_tx_frame() - short iovec is built and discarded
	cmtp_send_frame() - single-element iovec discarded
	hidp_send_frame() - single-element iovec discarded
	rfcomm_send_frame() - single-element iovec discarded
	rfcomm_send_test() - short iovec is built and discarded
	core sock_no_sendpage*() - single-element iovec discarded
	ipvs ip_vs_send_async() - single-element iovec discarded
	rds rds_tcp_sendmsg() - single-element iovec discarded
	rxrpc rxrpc_busy() - single-element iovec discarded
	rxrpc rxrpc_process_call() - short iovec is built and discarded
	rxrpc rxrpc_abort_connection() - short iovec is built and discarded
	rxrpc rxrpc_reject_packets() - assumes that sendmsg drains iovec;
may be a bug.
	rxrpc rxrpc_send_packet() - single-element iovec discarded
	rxrpc rxkad_issue_challenge() - short iovec is built and discarded
	rxrpc rxkad_send_response - short iovec is built and discarded
	sunrpc xs_send_kvec() - single-element iovec discarded; really asks
or iov_iter...
	tipc tipc_send_to_sock() - single-element iovec discarded; for some
reason it seems to believe that short writes never happen...
	ceph ceph_tcp_sendmsg() - one caller appears to discard iovec, another
(write_partial_kvec()) apparently assumes that iovec is unchanged by sendmsg.
Not guaranteed to be true for TCP, AFAICAS.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/