From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.6 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id D4CC4C48BD5 for ; Tue, 25 Jun 2019 22:00:35 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 564EE2080C for ; Tue, 25 Jun 2019 22:00:35 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="i2vcwrv1" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726354AbfFYWAe (ORCPT ); Tue, 25 Jun 2019 18:00:34 -0400 Received: from mail-ed1-f52.google.com ([209.85.208.52]:43599 "EHLO mail-ed1-f52.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726287AbfFYWAe (ORCPT ); Tue, 25 Jun 2019 18:00:34 -0400 Received: by mail-ed1-f52.google.com with SMTP id e3so75264edr.10 for ; Tue, 25 Jun 2019 15:00:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=mT57ddXlrZxmLsqqECxhNnFtnWSBoCrk1TKndhdH9gc=; b=i2vcwrv1hOacz163jUaLlCOyB4XZK1aCcM0/B7ri0wDCxw+psaMBhTmD258Bgbbtxv dL2tsl2fBRNvFi6Vkt4HFJWVuRiCAMRHE8B0cowtWttHQD4wVScPwAHrP7M0WPCAehVr yWO91jRNedhf1FYPP3QDiKhLrInvH/EuYgK1mAeBFAarTgx/WxI6d7hrPIgMMOIlolVi xKEWS78NCEFDAvJfzi5RuLkqgpYSsR+lIMDqlOsHTi2x9/69gRNmWtsfAJL8DCrcQTqN etEsSwXbW/KK/OgZuhztz1qx4L+pnV/H/40klI/UQM+Y0p6OHUSyMmoXwT3bwJR/Yw1l pvWA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=mT57ddXlrZxmLsqqECxhNnFtnWSBoCrk1TKndhdH9gc=; b=pimTrp0DOITq6pElMHz901kgA+Kp8MNOJrdn5X0CNCTKdp243lgTafIRb3XUSibhpq ZV4lcDVZlPi3je7TXYMiQYL7OiD0ihhMLLm+BER3A9gYYpixfp9+PmKZKIhvuqjHqnSd 6yvnlj4vEosAlFBaIDpR8X258rSm0ta+43/0xrQSU4dW2kqdkpAIzh9/FXCXOcmbYFRV 3FEy8/x8GU5T+BZvaI3zH6/TEEGzQPLh0CThx0WmY32uTP+wXQ3p2cHFS2FfHGNdWmwZ 6OgpkIO8ZHkqVQwWDKnd+5HIf1d+jYrvKRwBtkCnOpLtWphPTjI+YRMoyfjrPPhjoSyx imCg== X-Gm-Message-State: APjAAAVVbCp2gnlDL2H/NiicsEUKnUtmOOjp7YvVUcL4InAZwuIbpQXZ WUrsOeW0pfo1O0gHbEIGt+tx1m+uSFL+45SoQTs= X-Google-Smtp-Source: APXvYqy8GsqojqaUcvznFdqluV8o/axKncshNFpjzK5TUDMbCOxM+HhBFrmM4SG/rdBLSsYc+9UQgoLKEBOnBuRU0TM= X-Received: by 2002:a17:906:2acf:: with SMTP id m15mr731221eje.31.1561500031589; Tue, 25 Jun 2019 15:00:31 -0700 (PDT) MIME-Version: 1.0 References: <20190619202533.4856-1-nhorman@tuxdriver.com> <20190624004604.25607-1-nhorman@tuxdriver.com> <20190624215142.GA8181@hmswarspite.think-freely.org> <20190625110247.GA29902@hmswarspite.think-freely.org> <20190625162028.GC29902@hmswarspite.think-freely.org> In-Reply-To: <20190625162028.GC29902@hmswarspite.think-freely.org> From: Willem de Bruijn Date: Tue, 25 Jun 2019 17:59:55 -0400 Message-ID: Subject: Re: [PATCH v3 net] af_packet: Block execution of tasks waiting for transmit to complete in AF_PACKET To: Neil Horman Cc: Network Development , Matteo Croce , "David S. Miller" Content-Type: text/plain; charset="UTF-8" Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org On Tue, Jun 25, 2019 at 12:20 PM Neil Horman wrote: > > On Tue, Jun 25, 2019 at 09:37:17AM -0400, Willem de Bruijn wrote: > > On Tue, Jun 25, 2019 at 7:03 AM Neil Horman wrote: > > > > > > On Mon, Jun 24, 2019 at 06:15:29PM -0400, Willem de Bruijn wrote: > > > > > > > + if (need_wait && !packet_next_frame(po, &po->tx_ring, TP_STATUS_SEND_REQUEST)) { > > > > > > > + po->wait_on_complete = 1; > > > > > > > + timeo = sock_sndtimeo(&po->sk, msg->msg_flags & MSG_DONTWAIT); > > > > > > > > > > > > This resets timeout on every loop. should only set above the loop once. > > > > > > > > > > > I explained exactly why I did that in the change log. Its because I reuse the > > > > > timeout variable to get the return value of the wait_for_complete call. > > > > > Otherwise I need to add additional data to the stack, which I don't want to do. > > > > > Sock_sndtimeo is an inline function and really doesn't add any overhead to this > > > > > path, so I see no reason not to reuse the variable. > > > > > > > > The issue isn't the reuse. It is that timeo is reset to sk_sndtimeo > > > > each time. Whereas wait_for_common and its variants return the > > > > number of jiffies left in case the loop needs to sleep again later. > > > > > > > > Reading sock_sndtimeo once and passing it to wait_.. repeatedly is a > > > > common pattern across the stack. > > > > > > > But those patterns are unique to those situations. For instance, in > > > tcp_sendmsg_locked, we aquire the value of the socket timeout, and use that to > > > wait for the entire message send operation to complete, which consists of > > > potentially several blocking operations (waiting for the tcp connection to be > > > established, waiting for socket memory, etc). In that situation we want to wait > > > for all of those operations to complete to send a single message, and fail if > > > they exceed the timeout in aggregate. The semantics are different with > > > AF_PACKET. In this use case, the message is in effect empty, and just used to > > > pass some control information. tpacket_snd, sends as many frames from the > > > memory mapped buffer as possible, and on each iteration we want to wait for the > > > specified timeout for those frames to complete sending. I think resetting the > > > timeout on each wait instance is the right way to go here. > > > > I disagree. If a SO_SNDTIMEO of a given time is set, the thread should > > not wait beyond that. Else what is the point of passing a specific > > duration in the syscall? > > > I can appreciate that, but you said yourself that you wanted to minimize this > change. The current behavior of AF_PACKET is to: > a) check for their being no more packets ready to send > b) if (a) is false we schedule() (nominally allowing other tasks to run) > c) when (b) is complete, check for additional frames to send, and exit if there > are not, otherwise, continue sending and waiting > > In that model, a single task calling sendmsg can run in the kernel indefinately, > if userspace continues to fill the memory mapped buffer > > Given that model, resetting the timeout on each call to wait_for_completion > keeps that behavior. In fact, if we allow the timeout value to get continuously > decremented, it will be possible for a call to sendmsg in this protocol to > _always_ return -ETIMEDOUT (if we keep the buffer full in userspace long enough, > then the sending task will eventually decrement timeo to zero, and force an > -ETIMEDOUT call). > > I'm convinced that resetting the timeout here is the right way to go. It upholds the contract, but extends when userspace appends to the ring. Okay, yes, that makes sense. > > Btw, we can always drop the timeo and go back to unconditional (bar > > signals) waiting. > > > We could, but it would be nice to implement the timeout feature here if > possible. > > > > > > > > > > > @@ -2728,6 +2755,11 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) > > > > > > > err = net_xmit_errno(err); > > > > > > > if (err && __packet_get_status(po, ph) == > > > > > > > TP_STATUS_AVAILABLE) { > > > > > > > + /* re-init completion queue to avoid subsequent fallthrough > > > > > > > + * on a future thread calling wait_on_complete_interruptible_timeout > > > > > > > + */ > > > > > > > + po->wait_on_complete = 0; > > > > > > > > > > > > If setting where sleeping, no need for resetting if a failure happens > > > > > > between those blocks. > > > > > > > > > > > > > + init_completion(&po->skb_completion); > > > > > > > > > > > > no need to reinit between each use? > > > > > > > > > > > I explained exactly why I did this in the comment above. We have to set > > > > > wait_for_complete prior to calling transmit, so as to ensure that we call > > > > > wait_for_completion before we exit the loop. However, in this error case, we > > > > > exit the loop prior to calling wait_for_complete, so we need to reset the > > > > > completion variable and the wait_for_complete flag. Otherwise we will be in a > > > > > case where, on the next entrace to this loop we will have a completion variable > > > > > with completion->done > 0, meaning the next wait will be a fall through case, > > > > > which we don't want. > > > > > > > > By moving back to the point where schedule() is called, hopefully this > > > > complexity automatically goes away. Same as my comment to the line > > > > immediately above. > > > > > > > Its going to change what the complexity is, actually. I was looking at this > > > last night, and I realized that your assertion that we could remove > > > packet_next_frame came at a cost. This is because we need to determine if we > > > have to wait before we call po->xmit, but we need to actually do the wait after > > > po->xmit > > > > Why? The existing method using schedule() doesn't. > > > Because the existing method using schedule doesn't have to rely on an > asynchronous event to wake the sending task back up. Specifically, we need to > set some state that indicates tpacket_destruct_skb should call complete, and > since tpacket_destruct_skb is called asynchronously in a separate execution > context, we need to ensure there is no race condition whereby we execute > tpacket_destruct_skb before we set the state that we also use to determine if we > should call wait_for_complete. ie: > 1) tpacket_send_skb calls po->xmit > 2) tpacket_send_skb loops around and checks to see if there are more packets to > send. If not and if need_wait is set we call wait_for_complete > > If tpacket_destruct_skb is called after (2), we're fine. But if its called > between (1) and (2), then tpacket_destruct_skb won't call complete (because > wait_on_complete isn't set yet), causing a hang. > > Because you wanted to remove packet_next_frame, we have no way to determine if > we're done sending frames prior to calling po->xmit, which is the point past > which tpacket_destruct_skb might be called, so now I have to find an alternate > state condition to key off of. > > > > Can we not just loop while sending and sleep immediately when > > __packet_get_status returns !TP_STATUS_AVAILABLE? > > > > I don't understand the need to probe the next packet to send instead > > of the current. > > > See above, we can definately check the return of ph at the top of the loop, and > sleep if its NULL, but in so doing we cant use po->wait_on_complete as a gating > variable because we've already called po->xmit. Once we call that function, if > we haven't already set po->wait_on_complete to true, its possible > tpacket_destruct_skb will already have been called before we get to the point > where we check wait_for_completion. That means we will wait on a completion > variable that never gets completed, so I need to find a new way to track weather > or not we are waiting. Its entirely doable, I'm sure, its just not as straight > forward as your making it out to be. > > > This seems to be the crux of the disagreement. My guess is that it has > > to do with setting wait_on_complete, but I don't see the problem. It > > can be set immediately before going to sleep. > > > The reason has to do with the fact that > tpacket_destruct_skb can be called from ksoftirq context (i.e. it will be called > asynchrnously, not from the thread running in tpacket_snd). Condsider this > flow, assuming we do as you suggest, and just set po->wait_on_complete=1 > immediately prior to calling wait_for_complete_interruptible_timeout: > > 1) tpacket_snd gets a frame, builds the skb for transmission > 2) tpakcket_snd calls po->xmit(skb), sending the frame to the driver > 3) tpacket_snd, iterates through back to the top of the loop, and determines > that we need to wait for the previous packet to complete > 4) The driver gets a tx completion interrupt, interrupting the cpu on which we > are executing, and calls kfree_skb_irq on the skb we just transmitted > 5) the ksoftirq runs on the cpu, calling kfree_skb on our skb > 6) (5) calls skb->destruct (which is tpakcet_skb_destruct) > 7) tpacket_skb_destruct executes, sees that po->wait_on_completion is zero, and > skips calling complete() > 8) the thread running tpacket_snd gets scheduled back on the cpu and now sets > po->wait_on_complete, then calls wait_for_completion_interruptible_timeout, but > since the skb we are waiting to complete has already been freed, we will never > get the completion notification, and so we will wait for the maximal timeout, > which is absolutely not what we want. > > Interestingly (Ironically), that control flow will never happen in the use case > I'm trying to fix here, because its SCHED_FIFO, and will run until such time as > we call wait_for_completion_interuptible_timeout, so in this case we're safe. > But in the nominal case, where the sending task is acturally SCHED_OTHER, the > above can aboslutely happen, and thats very broken. > > > I don't meant to draw this out, btw, or add to your workload. If you > > prefer, I can take a stab at my (admittedly hand-wavy) suggestion. > > > No, I have another method in flight that I'm testing now, it removes the > po->wait_on_complete variable entirely, checking instead to make sure that > we've: > a) sent at least one frame > and > b) that we have a positive pending count > and > c) that we need to wait > and > d) that we have no more frames to send > > This is why I was saying that your idea can be done, but it trades one form of > complexity for another. Thanks for the detailed explanation of the races. I was thinking this through, but just see V4. Will take a look at that.