From mboxrd@z Thu Jan 1 00:00:00 1970 From: Alexei Starovoitov Subject: Re: [PATCH net-next 1/3] net/sock: factor out dequeue/peek with offset code Date: Wed, 24 Oct 2018 14:23:13 -0700 Message-ID: <20181024212219.je7vtvpjovlbdrqa@ast-mbp> References: <503e324ecd4085f256474df1a352a92814fd29f4.1494837879.git.pabeni@redhat.com> <20181023044929.guyx7uwf5ndt6hiz@ast-mbp> <4c94ee8fe77a51d61927bfff46441abc15172193.camel@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: netdev@vger.kernel.org, "David S. Miller" , Eric Dumazet , kafai@fb.com, daniel@iogearbox.net To: Paolo Abeni Return-path: Received: from mail-pl1-f193.google.com ([209.85.214.193]:41264 "EHLO mail-pl1-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725829AbeJYFw6 (ORCPT ); Thu, 25 Oct 2018 01:52:58 -0400 Received: by mail-pl1-f193.google.com with SMTP id p5-v6so2821566plq.8 for ; Wed, 24 Oct 2018 14:23:16 -0700 (PDT) Content-Disposition: inline In-Reply-To: <4c94ee8fe77a51d61927bfff46441abc15172193.camel@redhat.com> Sender: netdev-owner@vger.kernel.org List-ID: On Tue, Oct 23, 2018 at 09:28:03AM +0200, Paolo Abeni wrote: > Hi, > > On Mon, 2018-10-22 at 21:49 -0700, Alexei Starovoitov wrote: > > On Mon, May 15, 2017 at 11:01:42AM +0200, Paolo Abeni wrote: > > > And update __sk_queue_drop_skb() to work on the specified queue. > > > This will help the udp protocol to use an additional private > > > rx queue in a later patch. > > > > > > Signed-off-by: Paolo Abeni > > > --- > > > include/linux/skbuff.h | 7 ++++ > > > include/net/sock.h | 4 +-- > > > net/core/datagram.c | 90 ++++++++++++++++++++++++++++---------------------- > > > 3 files changed, 60 insertions(+), 41 deletions(-) > > > > > > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h > > > index a098d95..bfc7892 100644 > > > --- a/include/linux/skbuff.h > > > +++ b/include/linux/skbuff.h > > > @@ -3056,6 +3056,13 @@ static inline void skb_frag_list_init(struct sk_buff *skb) > > > > > > int __skb_wait_for_more_packets(struct sock *sk, int *err, long *timeo_p, > > > const struct sk_buff *skb); > > > +struct sk_buff *__skb_try_recv_from_queue(struct sock *sk, > > > + struct sk_buff_head *queue, > > > + unsigned int flags, > > > + void (*destructor)(struct sock *sk, > > > + struct sk_buff *skb), > > > + int *peeked, int *off, int *err, > > > + struct sk_buff **last); > > > struct sk_buff *__skb_try_recv_datagram(struct sock *sk, unsigned flags, > > > void (*destructor)(struct sock *sk, > > > struct sk_buff *skb), > > > diff --git a/include/net/sock.h b/include/net/sock.h > > > index 66349e4..49d226f 100644 > > > --- a/include/net/sock.h > > > +++ b/include/net/sock.h > > > @@ -2035,8 +2035,8 @@ void sk_reset_timer(struct sock *sk, struct timer_list *timer, > > > > > > void sk_stop_timer(struct sock *sk, struct timer_list *timer); > > > > > > -int __sk_queue_drop_skb(struct sock *sk, struct sk_buff *skb, > > > - unsigned int flags, > > > +int __sk_queue_drop_skb(struct sock *sk, struct sk_buff_head *sk_queue, > > > + struct sk_buff *skb, unsigned int flags, > > > void (*destructor)(struct sock *sk, > > > struct sk_buff *skb)); > > > int __sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb); > > > diff --git a/net/core/datagram.c b/net/core/datagram.c > > > index db1866f2..a4592b4 100644 > > > --- a/net/core/datagram.c > > > +++ b/net/core/datagram.c > > > @@ -161,6 +161,43 @@ static struct sk_buff *skb_set_peeked(struct sk_buff *skb) > > > return skb; > > > } > > > > > > +struct sk_buff *__skb_try_recv_from_queue(struct sock *sk, > > > + struct sk_buff_head *queue, > > > + unsigned int flags, > > > + void (*destructor)(struct sock *sk, > > > + struct sk_buff *skb), > > > + int *peeked, int *off, int *err, > > > + struct sk_buff **last) > > > +{ > > > + struct sk_buff *skb; > > > + > > > + *last = queue->prev; > > > > this refactoring changed the behavior. > > Now queue->prev is returned as last. > > Whereas it was *last = queue before. > > > > > + skb_queue_walk(queue, skb) { > > > > and *last = skb assignment is gone too. > > > > Was this intentional ? > > Yes. > > > Is this the right behavior? > > I think so. queue->prev is the last skb in the queue. With the old > code, __skb_try_recv_datagram(), when returning NULL, used the > instructions you quoted to overall set 'last' to the last skb in the > queue. We did not use 'last' elsewhere. So overall this just reduce the > number of instructions inside the loop. (unless I'm missing something). Right. On the second glance it does appear to be correct. > Are you experiencing any specific issues due to the mentioned commit? yes. Just like what Baoyou Xie reported https://lore.kernel.org/patchwork/patch/962802/ we're hitting infinite loop in __skb_recv_datagram() on 4.11 kernel. and different, but also buggy, behavior on the net-next. In particular __skb_try_recv_datagram() returns immediately, because skb_queue_empty() is true (sk->sk_receive_queue.next == &sk->sk_receive_queue) But __skb_wait_for_more_packets() also returns immediately because if (sk->sk_receive_queue.prev != skb) is also true. There is a link list corruption in sk_receive_queue. list->next == list, but list->prev still points to valid skb. Before your commit we had *last = queue; and we had this infinite loop I described above. After your commit *last = queue->next; which assigns buggy pointer into *last, but that accidentally makes if (sk->sk_receive_queue.prev != skb) to be false and __skb_wait_for_more_packets() goes into schedule_timeout(). Eventually bad things happen too, but in the different spot. The corruption is somehow related to netlink_recvmsg() just like in that Baoyou Xie report. The typical stack trace is __skb_wait_for_more_packets+0x64/0x140 ? skb_gro_receive+0x310/0x310 __skb_recv_datagram+0x5c/0xa0 skb_recv_datagram+0x31/0x40 netlink_recvmsg+0x51/0x3c0 ? sock_write_iter+0xf8/0x110 SYSC_recvfrom+0x116/0x190 We didn't figure out a way to reproduce it yet. kasan didn't help. The way netlink socket pushes skbs into sk_receive_queue and drains it all looks correct. We thought it could be MSG_PEAK related, but skb->users refcnting also looks correct. If anyone have any ideas what things to try, I'm all ears.