All of lore.kernel.org
 help / color / mirror / Atom feed
From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
To: Neil Horman <nhorman@tuxdriver.com>
Cc: Network Development <netdev@vger.kernel.org>,
	Matteo Croce <mcroce@redhat.com>,
	"David S. Miller" <davem@davemloft.net>
Subject: Re: [PATCH net] af_packet: Block execution of tasks waiting for transmit to complete in AF_PACKET
Date: Thu, 20 Jun 2019 12:18:41 -0400	[thread overview]
Message-ID: <CAF=yD-KtqhHfzRtMVm17f1gfZRuSacB1M-QBSP8dY5Kz_Cn+Yw@mail.gmail.com> (raw)
In-Reply-To: <20190620161411.GE18890@hmswarspite.think-freely.org>

On Thu, Jun 20, 2019 at 12:14 PM Neil Horman <nhorman@tuxdriver.com> wrote:
>
> On Thu, Jun 20, 2019 at 11:16:13AM -0400, Willem de Bruijn wrote:
> > On Thu, Jun 20, 2019 at 10:24 AM Neil Horman <nhorman@tuxdriver.com> wrote:
> > >
> > > On Thu, Jun 20, 2019 at 09:41:30AM -0400, Willem de Bruijn wrote:
> > > > On Wed, Jun 19, 2019 at 4:26 PM Neil Horman <nhorman@tuxdriver.com> wrote:
> > > > >
> > > > > When an application is run that:
> > > > > a) Sets its scheduler to be SCHED_FIFO
> > > > > and
> > > > > b) Opens a memory mapped AF_PACKET socket, and sends frames with the
> > > > > MSG_DONTWAIT flag cleared, its possible for the application to hang
> > > > > forever in the kernel.  This occurs because when waiting, the code in
> > > > > tpacket_snd calls schedule, which under normal circumstances allows
> > > > > other tasks to run, including ksoftirqd, which in some cases is
> > > > > responsible for freeing the transmitted skb (which in AF_PACKET calls a
> > > > > destructor that flips the status bit of the transmitted frame back to
> > > > > available, allowing the transmitting task to complete).
> > > > >
> > > > > However, when the calling application is SCHED_FIFO, its priority is
> > > > > such that the schedule call immediately places the task back on the cpu,
> > > > > preventing ksoftirqd from freeing the skb, which in turn prevents the
> > > > > transmitting task from detecting that the transmission is complete.
> > > > >
> > > > > We can fix this by converting the schedule call to a completion
> > > > > mechanism.  By using a completion queue, we force the calling task, when
> > > > > it detects there are no more frames to send, to schedule itself off the
> > > > > cpu until such time as the last transmitted skb is freed, allowing
> > > > > forward progress to be made.
> > > > >
> > > > > Tested by myself and the reporter, with good results
> > > > >
> > > > > Appies to the net tree
> > > > >
> > > > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > > > > Reported-by: Matteo Croce <mcroce@redhat.com>
> > > > > CC: "David S. Miller" <davem@davemloft.net>
> > > > > ---
> > > >
> > > > This is a complex change for a narrow configuration. Isn't a
> > > > SCHED_FIFO process preempting ksoftirqd a potential problem for other
> > > > networking workloads as well? And the right configuration to always
> > > > increase ksoftirqd priority when increasing another process's
> > > > priority? Also, even when ksoftirqd kicks in, isn't some progress
> > > > still made on the local_bh_enable reached from schedule()?
> > > >
> > >
> > > A few questions here to answer:
> >
> > Thanks for the detailed explanation.
> >
> Gladly.
>
> > > Regarding other protocols having this problem, thats not the case, because non
> > > packet sockets honor the SK_SNDTIMEO option here (i.e. they sleep for a period
> > > of time specified by the SNDTIMEO option if MSG_DONTWAIT isn't set.  We could
> > > certainly do that, but the current implementation doesn't (opting instead to
> > > wait indefinately until the respective packet(s) have transmitted or errored
> > > out), and I wanted to maintain that behavior.  If there is consensus that packet
> > > sockets should honor SNDTIMEO, then I can certainly do that.
> > >
> > > As for progress made by calling local_bh_enable, My read of the code doesn't
> > > have the scheduler calling local_bh_enable at all.  Instead schedule uses
> > > preempt_disable/preempt_enable_no_resched() to gain exlcusive access to the cpu,
> > > which ignores pending softirqs on re-enablement.
> >
> > Ah, I'm mistaken there, then.
> >
> > >  Perhaps that needs to change,
> > > but I'm averse to making scheduler changes for this (the aforementioned concern
> > > about complex changes for a narrow use case)
> > >
> > > Regarding raising the priority of ksoftirqd, that could be a solution, but the
> > > priority would need to be raised to a high priority SCHED_FIFO parameter, and
> > > that gets back to making complex changes for a narrow problem domain
> > >
> > > As for the comlexity of the of the solution, I think this is, given your
> > > comments the least complex and intrusive change to solve the given problem.
> >
> > Could it be simpler to ensure do_softirq() gets run here? That would
> > allow progress for this case.
> >
> I'm not sure.  On the surface, we certainly could do it, but inserting a call to
> do_softirq, either directly, or indirectly through some other mechanism seems
> like a non-obvious fix, and may lead to confusion down the road.  I'm hesitant
> to pursue such a soultion without some evidence it would make a better solution.
>
> > >  We
> > > need to find a way to force the calling task off the cpu while the asynchronous
> > > operations in the transmit path complete, and we can do that this way, or by
> > > honoring SK_SNDTIMEO.  I'm fine with doing the latter, but I didn't want to
> > > alter the current protocol behavior without consensus on that.
> >
> > In general SCHED_FIFO is dangerous with regard to stalling other
> > progress, incl. ksoftirqd. But it does appear that this packet socket
> > case is special inside networking in calling schedule() directly here.
> >
> > If converting that, should it convert to logic more akin to other
> > sockets, like sock_wait_for_wmem? I haven't had a chance to read up on
> > the pros and cons of completion here yet, sorry. Didn't want to delay
> > responding until after I get a chance.
> >
> That would be the solution described above (i.e. honoring SK_SNDTIMEO.
> Basically you call sock_send_waittimeo, which returns a timeout value, or 0 if
> MSG_DONTWAIT is set), then you block for that period of time waiting for
> transmit completion.

From an ABI point of view, starting to support SK_SNDTIMEO where it
currently is not implemented certainly seems fine.

>  I'm happy to implement that solution, but I'd like to get
> some clarity as to if there is a reason we don't currently honor that socket
> option now before I change the behavior that way.
>
> Dave, do you have any insight into AF_PACKET history as to why we would ignore
> the send timeout socket option here?

On the point of calling schedule(): even if that is rare, there are a lot of
other cond_resched() calls that may have the same starvation issue
with SCHED_FIFO and ksoftirqd.

  reply	other threads:[~2019-06-20 16:19 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-06-19 20:25 [PATCH net] af_packet: Block execution of tasks waiting for transmit to complete in AF_PACKET Neil Horman
2019-06-20 13:41 ` Willem de Bruijn
2019-06-20 14:01   ` Matteo Croce
2019-06-20 14:23   ` Neil Horman
2019-06-20 15:16     ` Willem de Bruijn
2019-06-20 16:14       ` Neil Horman
2019-06-20 16:18         ` Willem de Bruijn [this message]
2019-06-20 17:31           ` Neil Horman
2019-06-21 16:41       ` Neil Horman
2019-06-21 18:31         ` Willem de Bruijn
2019-06-21 19:18           ` Neil Horman
2019-06-21 20:06             ` Willem de Bruijn
2019-06-22 11:08               ` Neil Horman
2019-06-22 17:41 ` [PATCH v2 " Neil Horman
2019-06-23  2:12   ` Willem de Bruijn
2019-06-23  2:21     ` Willem de Bruijn
2019-06-23 11:40       ` Neil Horman
2019-06-23 14:39         ` Willem de Bruijn
2019-06-23 19:21           ` Neil Horman
2019-06-23 11:34     ` Neil Horman
2019-06-24  0:46 ` [PATCH v3 " Neil Horman
2019-06-24 18:08   ` Willem de Bruijn
2019-06-24 21:51     ` Neil Horman
2019-06-24 22:15       ` Willem de Bruijn
2019-06-25 11:02         ` Neil Horman
2019-06-25 13:37           ` Willem de Bruijn
2019-06-25 16:20             ` Neil Horman
2019-06-25 21:59               ` Willem de Bruijn
2019-06-25 21:57 ` [PATCH v4 " Neil Horman
2019-06-25 22:30   ` Willem de Bruijn
2019-06-26 10:54     ` Neil Horman
2019-06-26 15:05       ` Willem de Bruijn
2019-06-26 17:14         ` Neil Horman
2019-06-27  2:38   ` David Miller

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAF=yD-KtqhHfzRtMVm17f1gfZRuSacB1M-QBSP8dY5Kz_Cn+Yw@mail.gmail.com' \
    --to=willemdebruijn.kernel@gmail.com \
    --cc=davem@davemloft.net \
    --cc=mcroce@redhat.com \
    --cc=netdev@vger.kernel.org \
    --cc=nhorman@tuxdriver.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.