Re: [PATCH net-next 1/3] net/sock: factor out dequeue/peek with offset code

From: Alexei Starovoitov <alexei.starovoitov@gmail.com>
To: Paolo Abeni <pabeni@redhat.com>
Cc: netdev@vger.kernel.org, "David S. Miller" <davem@davemloft.net>,
	Eric Dumazet <edumazet@google.com>,
	kafai@fb.com, daniel@iogearbox.net
Subject: Re: [PATCH net-next 1/3] net/sock: factor out dequeue/peek with offset code
Date: Wed, 24 Oct 2018 14:23:13 -0700	[thread overview]
Message-ID: <20181024212219.je7vtvpjovlbdrqa@ast-mbp> (raw)
In-Reply-To: <4c94ee8fe77a51d61927bfff46441abc15172193.camel@redhat.com>

On Tue, Oct 23, 2018 at 09:28:03AM +0200, Paolo Abeni wrote:
> Hi,
> 
> On Mon, 2018-10-22 at 21:49 -0700, Alexei Starovoitov wrote:
> > On Mon, May 15, 2017 at 11:01:42AM +0200, Paolo Abeni wrote:
> > > And update __sk_queue_drop_skb() to work on the specified queue.
> > > This will help the udp protocol to use an additional private
> > > rx queue in a later patch.
> > > 
> > > Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> > > ---
> > >  include/linux/skbuff.h |  7 ++++
> > >  include/net/sock.h     |  4 +--
> > >  net/core/datagram.c    | 90 ++++++++++++++++++++++++++++----------------------
> > >  3 files changed, 60 insertions(+), 41 deletions(-)
> > > 
> > > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > > index a098d95..bfc7892 100644
> > > --- a/include/linux/skbuff.h
> > > +++ b/include/linux/skbuff.h
> > > @@ -3056,6 +3056,13 @@ static inline void skb_frag_list_init(struct sk_buff *skb)
> > >  
> > >  int __skb_wait_for_more_packets(struct sock *sk, int *err, long *timeo_p,
> > >  				const struct sk_buff *skb);
> > > +struct sk_buff *__skb_try_recv_from_queue(struct sock *sk,
> > > +					  struct sk_buff_head *queue,
> > > +					  unsigned int flags,
> > > +					  void (*destructor)(struct sock *sk,
> > > +							   struct sk_buff *skb),
> > > +					  int *peeked, int *off, int *err,
> > > +					  struct sk_buff **last);
> > >  struct sk_buff *__skb_try_recv_datagram(struct sock *sk, unsigned flags,
> > >  					void (*destructor)(struct sock *sk,
> > >  							   struct sk_buff *skb),
> > > diff --git a/include/net/sock.h b/include/net/sock.h
> > > index 66349e4..49d226f 100644
> > > --- a/include/net/sock.h
> > > +++ b/include/net/sock.h
> > > @@ -2035,8 +2035,8 @@ void sk_reset_timer(struct sock *sk, struct timer_list *timer,
> > >  
> > >  void sk_stop_timer(struct sock *sk, struct timer_list *timer);
> > >  
> > > -int __sk_queue_drop_skb(struct sock *sk, struct sk_buff *skb,
> > > -			unsigned int flags,
> > > +int __sk_queue_drop_skb(struct sock *sk, struct sk_buff_head *sk_queue,
> > > +			struct sk_buff *skb, unsigned int flags,
> > >  			void (*destructor)(struct sock *sk,
> > >  					   struct sk_buff *skb));
> > >  int __sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb);
> > > diff --git a/net/core/datagram.c b/net/core/datagram.c
> > > index db1866f2..a4592b4 100644
> > > --- a/net/core/datagram.c
> > > +++ b/net/core/datagram.c
> > > @@ -161,6 +161,43 @@ static struct sk_buff *skb_set_peeked(struct sk_buff *skb)
> > >  	return skb;
> > >  }
> > >  
> > > +struct sk_buff *__skb_try_recv_from_queue(struct sock *sk,
> > > +					  struct sk_buff_head *queue,
> > > +					  unsigned int flags,
> > > +					  void (*destructor)(struct sock *sk,
> > > +							   struct sk_buff *skb),
> > > +					  int *peeked, int *off, int *err,
> > > +					  struct sk_buff **last)
> > > +{
> > > +	struct sk_buff *skb;
> > > +
> > > +	*last = queue->prev;
> > 
> > this refactoring changed the behavior.
> > Now queue->prev is returned as last.
> > Whereas it was *last = queue before.
> > 
> > > +	skb_queue_walk(queue, skb) {
> > 
> > and *last = skb assignment is gone too.
> > 
> > Was this intentional ? 
> 
> Yes.
> 
> > Is this the right behavior?
> 
> I think so. queue->prev is the last skb in the queue. With the old
> code,   __skb_try_recv_datagram(), when returning NULL, used the
> instructions you quoted to overall set 'last' to the last skb in the
> queue. We did not use 'last' elsewhere. So overall this just reduce the
> number of instructions inside the loop. (unless I'm missing something).

Right. On the second glance it does appear to be correct.

> Are you experiencing any specific issues due to the mentioned commit?

yes.

Just like what Baoyou Xie reported https://lore.kernel.org/patchwork/patch/962802/
we're hitting infinite loop in __skb_recv_datagram() on 4.11 kernel.
and different, but also buggy, behavior on the net-next.

In particular __skb_try_recv_datagram() returns immediately,
because skb_queue_empty() is true (sk->sk_receive_queue.next == &sk->sk_receive_queue)

But __skb_wait_for_more_packets() also returns immediately
because if (sk->sk_receive_queue.prev != skb) is also true.

There is a link list corruption in sk_receive_queue.

list->next == list, but list->prev still points to valid skb.
Before your commit we had
*last = queue;
and we had this infinite loop I described above.
After your commit
*last = queue->next;
which assigns buggy pointer into *last, but that accidentally
makes if (sk->sk_receive_queue.prev != skb) to be false
and __skb_wait_for_more_packets() goes into schedule_timeout().
Eventually bad things happen too, but in the different spot.

The corruption is somehow related to netlink_recvmsg() just like in that
Baoyou Xie report.

The typical stack trace is
__skb_wait_for_more_packets+0x64/0x140
? skb_gro_receive+0x310/0x310
__skb_recv_datagram+0x5c/0xa0
skb_recv_datagram+0x31/0x40
netlink_recvmsg+0x51/0x3c0
? sock_write_iter+0xf8/0x110
SYSC_recvfrom+0x116/0x190

We didn't figure out a way to reproduce it yet.
kasan didn't help.
The way netlink socket pushes skbs into sk_receive_queue and drains it
all looks correct. We thought it could be MSG_PEAK related, but
skb->users refcnting also looks correct.

If anyone have any ideas what things to try, I'm all ears.