Re: [PATCH 0/2] mm: Two small fixes for recent syzbot reports

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH 0/2] mm: Two small fixes for recent syzbot reports
       [not found] <20200414040717.22040-1-hdanton@sina.com>
@ 2020-04-14  4:31 ` Jens Axboe
  0 siblings, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2020-04-14  4:31 UTC (permalink / raw)
  To: Hillf Danton; +Cc: Qian Cai, LKML, Linux-MM


> On Apr 13, 2020, at 10:07 PM, Hillf Danton <hdanton@sina.com> wrote:
> 
> 
>> On Mon, 13 Apr 2020 18:06:21 -0400 Qian Cai wrote:
>> 
>> BTW, I=E2=80=99ll be adding fuzzers to my daily linux-next routines =
>> where it triggers this
>> io_uring/scheduler bug almost immediately, so hopefully would buy syzbot =
>> some
>> time to resume on linux-next.
>> 
>> [67493.516737][T211750] BUG: unable to handle page fault for address: =
>> ffffffffffffffe8
>> [67493.557315][T211750] #PF: supervisor read access in kernel mode
>> [67493.586726][T211750] #PF: error_code(0x0000) - not-present page
>> [67493.614434][T211750] PGD f96e17067 P4D f96e17067 PUD f96e19067 PMD 0=20=
>> 
>> [67493.644846][T211750] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
>> [67493.674127][T211750] CPU: 55 PID: 211750 Comm: trinity-c127 Tainted: =
>> G    B        L    5.7.0-rc1-next-20200413 #4
>> [67493.722516][T211750] Hardware name: HP ProLiant DL380 Gen9/ProLiant =
>> DL380 Gen9, BIOS P89 04/12/2017
>> [67493.764925][T211750] RIP: 0010:__wake_up_common+0x98/0x290
>> __wake_up_common at kernel/sched/wait.c:87
>> [67493.790675][T211750] Code: 40 4d 8d 78 e8 49 8d 7f 18 49 39 fd 0f 84 =
>> 80 00 00 00 e8 6b bd 2b 00 49 8b 5f 18 45 31 e4 48 83 eb 18 4c 89 ff e8 =
>> 08 bc 2b 00 <45> 8b 37 41 f6 c6 04 75 71 49 8d 7f 10 e8 46 bd 2b 00 49 =
>> 8b 47 10
>> [67493.881650][T211750] RSP: 0018:ffffc9000adbfaf0 EFLAGS: 00010046
>> [67493.909854][T211750] RAX: 0000000000000000 RBX: ffffffffffffffe8 RCX: =
>> ffffffffaa9636b8
>> [67493.947131][T211750] RDX: 0000000000000003 RSI: dffffc0000000000 RDI: =
>> ffffffffffffffe8
>> [67493.983829][T211750] RBP: ffffc9000adbfb40 R08: fffffbfff582c5fd R09: =
>> fffffbfff582c5fd
>> [67494.020861][T211750] R10: ffffffffac162fe3 R11: fffffbfff582c5fc R12: =
>> 0000000000000000
>> [67494.059249][T211750] R13: ffff888ef82b0960 R14: ffffc9000adbfb80 R15: =
>> ffffffffffffffe8
>> [67494.099699][T211750] FS:  00007fdcba4c4740(0000) =
>> GS:ffff889033780000(0000) knlGS:0000000000000000
>> [67494.141858][T211750] CS:  0010 DS: 0000 ES: 0000 CR0: =
>> 0000000080050033
>> [67494.172660][T211750] CR2: ffffffffffffffe8 CR3: 0000000f776a0004 CR4: =
>> 00000000001606e0
>> [67494.209760][T211750] Call Trace:
>> [67494.224720][T211750]  __wake_up_common_lock+0xea/0x150
>> (inlined by) __wake_up_common_lock at kernel/sched/wait.c:124
>> [67494.248753][T211750]  ? __wake_up_common+0x290/0x290
>> [67494.272014][T211750]  ? lockdep_hardirqs_on+0x16/0x2c0
>> [67494.296139][T211750]  __wake_up+0x13/0x20
>> [67494.314946][T211750]  io_cqring_ev_posted+0x75/0xe0
>> (inlined by) io_cqring_ev_posted at fs/io_uring.c:1160
>> [67494.337726][T211750]  io_ring_ctx_wait_and_kill+0x1c0/0x2f0
>> io_ring_ctx_wait_and_kill at fs/io_uring.c:7305
>> [67494.363840][T211750]  io_uring_create+0xa8d/0x13b0
>> [67494.386526][T211750]  ? io_req_defer_prep+0x990/0x990
>> [67494.410119][T211750]  ? __kasan_check_write+0x14/0x20
>> [67494.433646][T211750]  io_uring_setup+0xb8/0x130
>> [67494.454870][T211750]  ? io_uring_create+0x13b0/0x13b0
>> [67494.478342][T211750]  ? check_flags.part.28+0x220/0x220
>> [67494.502947][T211750]  ? lockdep_hardirqs_on+0x16/0x2c0
>> [67494.526965][T211750]  __x64_sys_io_uring_setup+0x31/0x40
>> [67494.551820][T211750]  do_syscall_64+0xcc/0xaf0
>> [67494.574829][T211750]  ? syscall_return_slowpath+0x580/0x580
>> [67494.604591][T211750]  ? lockdep_hardirqs_off+0x1f/0x140
>> [67494.628901][T211750]  ? entry_SYSCALL_64_after_hwframe+0x3e/0xb3
>> [67494.657616][T211750]  ? trace_hardirqs_off_caller+0x3a/0x150
>> [67494.683999][T211750]  ? trace_hardirqs_off_thunk+0x1a/0x1c
>> [67494.709982][T211750]  entry_SYSCALL_64_after_hwframe+0x49/0xb3
> 
> 
> See if it makes your fuzzers happy.

It will, and we should probably do that separately too. I already posted another fix that avoids the posted call that triggers it.

— 
Jens Axboe


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 0/2] mm: Two small fixes for recent syzbot reports
@ 2020-04-08  1:40 Peter Xu
  2020-04-09  0:47 ` Andrew Morton
  0 siblings, 1 reply; 37+ messages in thread
From: Peter Xu @ 2020-04-08  1:40 UTC (permalink / raw)
  To: linux-kernel, Linus Torvalds, linux-mm; +Cc: Andrew Morton, peterx

The two patches should fix below syzbot reports:

  BUG: unable to handle kernel paging request in kernel_get_mempolicy
  https://lore.kernel.org/lkml/0000000000002b25f105a2a3434d@google.com/

  WARNING: bad unlock balance in __get_user_pages_remote
  https://lore.kernel.org/lkml/00000000000005c65d05a2b90e70@google.com/

Note that the 1st patch also applied two extra small changes comparing
to when posted on the list in that: (1) it squashed an "interupt"
spelling error that Andrew has pointed out when picked up, and (2) it
also initializes the "page" pointer to NULL.  But I'm fairly confident
it shouldn't affect the correctness of the patch.

The 2nd patch is exactly the patch posted previously.

Thanks,

Peter Xu (2):
  mm/mempolicy: Allow lookup_node() to handle fatal signal
  mm/gup: Mark lock taken only after a successful retake

 mm/gup.c       | 2 +-
 mm/mempolicy.c | 7 +++++--
 2 files changed, 6 insertions(+), 3 deletions(-)

-- 
2.24.1

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] mm: Two small fixes for recent syzbot reports
  2020-04-08  1:40 Peter Xu
@ 2020-04-09  0:47 ` Andrew Morton
  2020-04-09 11:49   ` Matthew Wilcox
  2020-04-09 12:55   ` Dmitry Vyukov
  0 siblings, 2 replies; 37+ messages in thread
From: Andrew Morton @ 2020-04-09  0:47 UTC (permalink / raw)
  To: Peter Xu; +Cc: linux-kernel, Linus Torvalds, linux-mm

On Tue,  7 Apr 2020 21:40:08 -0400 Peter Xu <peterx@redhat.com> wrote:

> The two patches should fix below syzbot reports:
> 
>   BUG: unable to handle kernel paging request in kernel_get_mempolicy
>   https://lore.kernel.org/lkml/0000000000002b25f105a2a3434d@google.com/
> 
>   WARNING: bad unlock balance in __get_user_pages_remote
>   https://lore.kernel.org/lkml/00000000000005c65d05a2b90e70@google.com/

(Is there an email address for the syzbot operators?)

sysbot does test linux-next, yet these patches sat in linux-next for a
month without a peep, but all hell broke loose when they hit Linus's
tree.  How could this have happened?

Possibly I've been carrying a later patch which fixed all this up, but
I'm not seeing anything like that.  Nothing at all against mm/gup.c.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] mm: Two small fixes for recent syzbot reports
  2020-04-09  0:47 ` Andrew Morton
@ 2020-04-09 11:49   ` Matthew Wilcox
  2020-04-09 13:00     ` Dmitry Vyukov
  2020-04-09 12:55   ` Dmitry Vyukov
  1 sibling, 1 reply; 37+ messages in thread
From: Matthew Wilcox @ 2020-04-09 11:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Xu, linux-kernel, Linus Torvalds, linux-mm, syzkaller-bugs

On Wed, Apr 08, 2020 at 05:47:32PM -0700, Andrew Morton wrote:
> On Tue,  7 Apr 2020 21:40:08 -0400 Peter Xu <peterx@redhat.com> wrote:
> 
> > The two patches should fix below syzbot reports:
> > 
> >   BUG: unable to handle kernel paging request in kernel_get_mempolicy
> >   https://lore.kernel.org/lkml/0000000000002b25f105a2a3434d@google.com/
> > 
> >   WARNING: bad unlock balance in __get_user_pages_remote
> >   https://lore.kernel.org/lkml/00000000000005c65d05a2b90e70@google.com/
> 
> (Is there an email address for the syzbot operators?)

I'd suggest syzkaller-bugs@googlegroups.com (added to the Cc).

But there's a deeper problem in that we don't have anywhere to stash
that kind of information in the kernel tree right now.  Perhaps a special
entry in the MAINTAINERS file for bot operators?  Or one entry per bot?

> sysbot does test linux-next, yet these patches sat in linux-next for a
> month without a peep, but all hell broke loose when they hit Linus's
> tree.  How could this have happened?
> 
> Possibly I've been carrying a later patch which fixed all this up, but
> I'm not seeing anything like that.  Nothing at all against mm/gup.c.
> 
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] mm: Two small fixes for recent syzbot reports
  2020-04-09 11:49   ` Matthew Wilcox
@ 2020-04-09 13:00     ` Dmitry Vyukov
  2020-04-09 18:16       ` Andrew Morton
  0 siblings, 1 reply; 37+ messages in thread
From: Dmitry Vyukov @ 2020-04-09 13:00 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, Peter Xu, LKML, Linus Torvalds, Linux-MM,
	syzkaller-bugs, syzkaller

On Thu, Apr 9, 2020 at 1:49 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Wed, Apr 08, 2020 at 05:47:32PM -0700, Andrew Morton wrote:
> > On Tue,  7 Apr 2020 21:40:08 -0400 Peter Xu <peterx@redhat.com> wrote:
> >
> > > The two patches should fix below syzbot reports:
> > >
> > >   BUG: unable to handle kernel paging request in kernel_get_mempolicy
> > >   https://lore.kernel.org/lkml/0000000000002b25f105a2a3434d@google.com/
> > >
> > >   WARNING: bad unlock balance in __get_user_pages_remote
> > >   https://lore.kernel.org/lkml/00000000000005c65d05a2b90e70@google.com/
> >
> > (Is there an email address for the syzbot operators?)
>
> I'd suggest syzkaller-bugs@googlegroups.com (added to the Cc).

syzkaller@googlegroups.com is a better one.
syzkaller-bugs@googlegroups.com plays more of an LKML role.

> But there's a deeper problem in that we don't have anywhere to stash
> that kind of information in the kernel tree right now.  Perhaps a special
> entry in the MAINTAINERS file for bot operators?  Or one entry per bot?

I don't mind adding syzkaller. Some time ago I wanted to contact
KernelCI, CKI, LKFT, 0-day owners, finding relevant lists wasn't
impossible, but for some it was hard.

For syzkaller it would be:

https://github.com/google/syzkaller/issues for bugs/feature requests.
syzkaller@googlegroups.com for discussions.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] mm: Two small fixes for recent syzbot reports
  2020-04-09 13:00     ` Dmitry Vyukov
@ 2020-04-09 18:16       ` Andrew Morton
  2020-04-09 18:53         ` Linus Torvalds
  0 siblings, 1 reply; 37+ messages in thread
From: Andrew Morton @ 2020-04-09 18:16 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Matthew Wilcox, Peter Xu, LKML, Linus Torvalds, Linux-MM,
	syzkaller-bugs, syzkaller

On Thu, 9 Apr 2020 15:00:20 +0200 Dmitry Vyukov <dvyukov@google.com> wrote:

> On Thu, Apr 9, 2020 at 1:49 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Wed, Apr 08, 2020 at 05:47:32PM -0700, Andrew Morton wrote:
> > > On Tue,  7 Apr 2020 21:40:08 -0400 Peter Xu <peterx@redhat.com> wrote:
> > >
> > > > The two patches should fix below syzbot reports:
> > > >
> > > >   BUG: unable to handle kernel paging request in kernel_get_mempolicy
> > > >   https://lore.kernel.org/lkml/0000000000002b25f105a2a3434d@google.com/
> > > >
> > > >   WARNING: bad unlock balance in __get_user_pages_remote
> > > >   https://lore.kernel.org/lkml/00000000000005c65d05a2b90e70@google.com/
> > >
> > > (Is there an email address for the syzbot operators?)
> >
> > I'd suggest syzkaller-bugs@googlegroups.com (added to the Cc).
> 
> syzkaller@googlegroups.com is a better one.
> syzkaller-bugs@googlegroups.com plays more of an LKML role.
> 
> > But there's a deeper problem in that we don't have anywhere to stash
> > that kind of information in the kernel tree right now.  Perhaps a special
> > entry in the MAINTAINERS file for bot operators?  Or one entry per bot?
> 
> I don't mind adding syzkaller. Some time ago I wanted to contact
> KernelCI, CKI, LKFT, 0-day owners, finding relevant lists wasn't
> impossible, but for some it was hard.
> 
> For syzkaller it would be:
> 
> https://github.com/google/syzkaller/issues for bugs/feature requests.
> syzkaller@googlegroups.com for discussions.

OK, thanks.  A MAINTAINERS entry would be great.

Could I please direct attention back to my original question regarding
the problems we've recently discovered in 4426e945df58 ("mm/gup: allow
VM_FAULT_RETRY for multiple times") and 71335f37c5e8 ("mm/gup: allow to
react to fatal signals")?

> sysbot does test linux-next, yet these patches sat in linux-next for a
> month without a peep, but all hell broke loose when they hit Linus's
> tree.  How could this have happened?
> 
> Possibly I've been carrying a later patch which fixed all this up, but
> I'm not seeing anything like that.  Nothing at all against mm/gup.c.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] mm: Two small fixes for recent syzbot reports
  2020-04-09 18:16       ` Andrew Morton
@ 2020-04-09 18:53         ` Linus Torvalds
  2020-04-09 19:12           ` Andrew Morton
  0 siblings, 1 reply; 37+ messages in thread
From: Linus Torvalds @ 2020-04-09 18:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dmitry Vyukov, Matthew Wilcox, Peter Xu, LKML, Linux-MM,
	syzkaller-bugs, syzkaller

On Thu, Apr 9, 2020 at 11:16 AM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> Could I please direct attention back to my original question regarding
> the problems we've recently discovered in 4426e945df58 ("mm/gup: allow
> VM_FAULT_RETRY for multiple times") and 71335f37c5e8 ("mm/gup: allow to
> react to fatal signals")?

What earlier question? The "how could this happen" one?

Dmitry already answered that one - are you perhaps missing the emails?

linux-next has apparently not worked at all for over a month. So it
got no testing at all, and thus also all the gup patches got no
testing in linux-next.

Only when they hit my tree, did they start getting testing. Not
because my tree is the only thing getting tested, but because my tree
is the only tree that _works_.

           Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] mm: Two small fixes for recent syzbot reports
  2020-04-09 18:53         ` Linus Torvalds
@ 2020-04-09 19:12           ` Andrew Morton
  2020-04-09 19:46             ` Linus Torvalds
  0 siblings, 1 reply; 37+ messages in thread
From: Andrew Morton @ 2020-04-09 19:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dmitry Vyukov, Matthew Wilcox, Peter Xu, LKML, Linux-MM,
	syzkaller-bugs, syzkaller

On Thu, 9 Apr 2020 11:53:33 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Thu, Apr 9, 2020 at 11:16 AM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > Could I please direct attention back to my original question regarding
> > the problems we've recently discovered in 4426e945df58 ("mm/gup: allow
> > VM_FAULT_RETRY for multiple times") and 71335f37c5e8 ("mm/gup: allow to
> > react to fatal signals")?
> 
> What earlier question? The "how could this happen" one?
> 
> Dmitry already answered that one - are you perhaps missing the emails?

Yup, email threading got broken.

> linux-next has apparently not worked at all for over a month. So it
> got no testing at all, and thus also all the gup patches got no
> testing in linux-next.
> 
> Only when they hit my tree, did they start getting testing. Not
> because my tree is the only thing getting tested, but because my tree
> is the only tree that _works_.
> 

And now the challenge is to protect your tree from the bad patches.

https://groups.google.com/forum/#!msg/syzkaller-bugs/phowYdNXHck/qU1P0TsjBAAJ
points at

net/openvswitch/conntrack.c
net/bluetooth/l2cap_sock.c
sound/core/oss/pcm_plugin.c

and other things, but it's 2+ weeks old.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] mm: Two small fixes for recent syzbot reports
  2020-04-09 19:12           ` Andrew Morton
@ 2020-04-09 19:46             ` Linus Torvalds
  2020-04-09 19:56               ` Matthew Wilcox
  0 siblings, 1 reply; 37+ messages in thread
From: Linus Torvalds @ 2020-04-09 19:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dmitry Vyukov, Matthew Wilcox, Peter Xu, LKML, Linux-MM,
	syzkaller-bugs, syzkaller

On Thu, Apr 9, 2020 at 12:12 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> And now the challenge is to protect your tree from the bad patches.

Well, right now, yes.

But in the longer term, I think we want to protect linux-next from the
bad patches so that they don't poison the testing that the bots can
do.

So that's why I suggested that linux-next and syzbot have some
protocol to have things that cause syzbot pain to be removed from
linux-next more aggressively.

              Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] mm: Two small fixes for recent syzbot reports
  2020-04-09 19:46             ` Linus Torvalds
@ 2020-04-09 19:56               ` Matthew Wilcox
  2020-04-09 19:58                 ` Linus Torvalds
  0 siblings, 1 reply; 37+ messages in thread
From: Matthew Wilcox @ 2020-04-09 19:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Dmitry Vyukov, Peter Xu, LKML, Linux-MM,
	syzkaller-bugs, Stephen Rothwell

On Thu, Apr 09, 2020 at 12:46:08PM -0700, Linus Torvalds wrote:
> On Thu, Apr 9, 2020 at 12:12 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > And now the challenge is to protect your tree from the bad patches.
> 
> Well, right now, yes.
> 
> But in the longer term, I think we want to protect linux-next from the
> bad patches so that they don't poison the testing that the bots can
> do.
> 
> So that's why I suggested that linux-next and syzbot have some
> protocol to have things that cause syzbot pain to be removed from
> linux-next more aggressively.

We should probably give Stephen a cc here ...

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] mm: Two small fixes for recent syzbot reports
  2020-04-09 19:56               ` Matthew Wilcox
@ 2020-04-09 19:58                 ` Linus Torvalds
  2020-04-09 20:27                   ` Eric Biggers
  0 siblings, 1 reply; 37+ messages in thread
From: Linus Torvalds @ 2020-04-09 19:58 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, Dmitry Vyukov, Peter Xu, LKML, Linux-MM,
	syzkaller-bugs, Stephen Rothwell

On Thu, Apr 9, 2020 at 12:56 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> We should probably give Stephen a cc here ...

Heh. I already did, but then that got broken because Andrew had lost
that part of the thread and the discussion re-started.

So Stephen was already cc'd for my original request to have linux-next
kick things out aggressively.

            Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] mm: Two small fixes for recent syzbot reports
  2020-04-09 19:58                 ` Linus Torvalds
@ 2020-04-09 20:27                   ` Eric Biggers
  2020-04-09 20:34                     ` Linus Torvalds
  0 siblings, 1 reply; 37+ messages in thread
From: Eric Biggers @ 2020-04-09 20:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, Andrew Morton, Dmitry Vyukov, Peter Xu, LKML,
	Linux-MM, syzkaller-bugs, Stephen Rothwell

On Thu, Apr 09, 2020 at 12:58:48PM -0700, Linus Torvalds wrote:
> On Thu, Apr 9, 2020 at 12:56 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > We should probably give Stephen a cc here ...
> 
> Heh. I already did, but then that got broken because Andrew had lost
> that part of the thread and the discussion re-started.
> 
> So Stephen was already cc'd for my original request to have linux-next
> kick things out aggressively.
> 

Well, if (for example) we look at
"linux-next test error: WARNING: suspicious RCU usage in ovs_ct_exit"
(https://lkml.kernel.org/lkml/000000000000e642a905a0cbee6e@google.com/),
it was sent to the maintainers of net/openvswitch/ where the warning occurred.
It was then ignored.

Would it help if bugs blocking testing on linux-next were Cc'ed to
linux-next@vger.kernel.org, so that Stephen could investigate?

FWIW, the issue of "syzbot report sent and ignored for months/years" is actually
a much broader one which applies to all bugs, not just build / test breakages.
There are tons of open bugs on https://syzkaller.appspot.com/upstream which are
definitely still valid (sort by "Last" occurred).  Long-term, to fix this we
really need syzbot to start sending reminders.  But first there's work needed to
make the noise level low enough so that people don't just tune them out.

- Eric

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] mm: Two small fixes for recent syzbot reports
  2020-04-09 20:27                   ` Eric Biggers
@ 2020-04-09 20:34                     ` Linus Torvalds
  2020-04-09 23:34                       ` Stephen Rothwell
  2020-04-10  1:11                       ` Theodore Y. Ts'o
  0 siblings, 2 replies; 37+ messages in thread
From: Linus Torvalds @ 2020-04-09 20:34 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Matthew Wilcox, Andrew Morton, Dmitry Vyukov, Peter Xu, LKML,
	Linux-MM, syzkaller-bugs, Stephen Rothwell

On Thu, Apr 9, 2020 at 1:27 PM Eric Biggers <ebiggers@kernel.org> wrote:
>
> Would it help if bugs blocking testing on linux-next were Cc'ed to
> linux-next@vger.kernel.org, so that Stephen could investigate?

Maybe. I'll let Stephen say.

But I think the big issue is the "blocking testing" part.

If it's "just" regular bugs, then:

> FWIW, the issue of "syzbot report sent and ignored for months/years" is actually
> a much broader one which applies to all bugs, not just build / test breakages.

I don't  know what to do about that, but it may be that people just
don't judge the bugs interesting or assume that they are old.

That's what made bugzilla so useless - being flooded with stale bugs
that might not be worth worrying about, and no way to really tell.

So old bugs generally should be aged out, and then if they still
happen, prioritized. With "this keeps us from even finding new bugs"
being a fairly high priority..

One de-motivational issue with syzbot reported bugs may be that they
sometimes get sent to the wrong set of people - but still wide enough
that everybody feels it's somebody elses issue. A kind of bystander
effect for bugs.

                Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] mm: Two small fixes for recent syzbot reports
  2020-04-09 20:34                     ` Linus Torvalds
@ 2020-04-09 23:34                       ` Stephen Rothwell
  2020-04-10  1:11                       ` Theodore Y. Ts'o
  1 sibling, 0 replies; 37+ messages in thread
From: Stephen Rothwell @ 2020-04-09 23:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Biggers, Matthew Wilcox, Andrew Morton, Dmitry Vyukov,
	Peter Xu, LKML, Linux-MM, syzkaller-bugs

[-- Attachment #1: Type: text/plain, Size: 714 bytes --]

Hi all,

On Thu, 9 Apr 2020 13:34:18 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> On Thu, Apr 9, 2020 at 1:27 PM Eric Biggers <ebiggers@kernel.org> wrote:
> >
> > Would it help if bugs blocking testing on linux-next were Cc'ed to
> > linux-next@vger.kernel.org, so that Stephen could investigate?  
> 
> Maybe. I'll let Stephen say.

It would certainly help so I could at least chase people and maybe
revert commits.  Dropping trees can be problematic once Andrew has
built his quilt series on top of them, so I try not to drop whole trees.
I can also use a previous version of a tree (which is usually what I
do if I discover a build problem).

-- 
Cheers,
Stephen Rothwell

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] mm: Two small fixes for recent syzbot reports
  2020-04-09 20:34                     ` Linus Torvalds
  2020-04-09 23:34                       ` Stephen Rothwell
@ 2020-04-10  1:11                       ` Theodore Y. Ts'o
  1 sibling, 0 replies; 37+ messages in thread
From: Theodore Y. Ts'o @ 2020-04-10  1:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Biggers, Matthew Wilcox, Andrew Morton, Dmitry Vyukov,
	Peter Xu, LKML, Linux-MM, syzkaller-bugs, Stephen Rothwell

On Thu, Apr 09, 2020 at 01:34:18PM -0700, Linus Torvalds wrote:
> > FWIW, the issue of "syzbot report sent and ignored for months/years" is actually
> > a much broader one which applies to all bugs, not just build / test breakages.
> 
> I don't  know what to do about that, but it may be that people just
> don't judge the bugs interesting or assume that they are old.

Syzkaller bugs which requuire (a) root privileges to trigger, or (b)
require a deliberately corrupted file system are things which I don't
consider super interesting.  (For the latter, I'll usually wait for
some other file system fuzzer to find it, such as Hydra, because
Syzkaller makes it painful extract out the file system image, where as
other file system fuzzers are *much* more file system developer
friendly.)

This shouldn't be a surprise to Dmitry, because I've given these
feedbacks to him before.

It would be nice if there was some way we could triage Syzkaller bugs
into different buckets (requires root, lower to P2; requires a
corrupted file system image, lower to P2).  Unfortunately, that would
require Syzkaller to have some kind of login system and way to track
state, and Dmitry doesn't want to replicate the functionality of a bug
tracker.

> That's what made bugzilla so useless - being flooded with stale bugs
> that might not be worth worrying about, and no way to really tell.

At least with Bugzilla, it becomes possible to attach priorities and
flags to them, instead of trying to assume that developers should
treat all Syzkaller bugs as the same priority.  Because when you do
insist that all bugs be treated as high priority, many people will
just treat them *all* as a P2 bug, especially when there are so many.

      	   	       	    	       	     	- Ted

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] mm: Two small fixes for recent syzbot reports
  2020-04-09  0:47 ` Andrew Morton
  2020-04-09 11:49   ` Matthew Wilcox
@ 2020-04-09 12:55   ` Dmitry Vyukov
  2020-04-09 16:32     ` Linus Torvalds
  1 sibling, 1 reply; 37+ messages in thread
From: Dmitry Vyukov @ 2020-04-09 12:55 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Peter Xu, LKML, Linus Torvalds, Linux-MM

On Thu, Apr 9, 2020 at 2:49 AM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Tue,  7 Apr 2020 21:40:08 -0400 Peter Xu <peterx@redhat.com> wrote:
>
> > The two patches should fix below syzbot reports:
> >
> >   BUG: unable to handle kernel paging request in kernel_get_mempolicy
> >   https://lore.kernel.org/lkml/0000000000002b25f105a2a3434d@google.com/
> >
> >   WARNING: bad unlock balance in __get_user_pages_remote
> >   https://lore.kernel.org/lkml/00000000000005c65d05a2b90e70@google.com/
>
> (Is there an email address for the syzbot operators?)
>
> sysbot does test linux-next, yet these patches sat in linux-next for a
> month without a peep, but all hell broke loose when they hit Linus's
> tree.  How could this have happened?

The same thing:
https://groups.google.com/d/msg/syzkaller-bugs/phowYdNXHck/qU1P0TsjBAAJ

linux-next is boot-broken for more than a month and bugs are piling
onto bugs, I've seen at least 3 different ones.
syzbot can't get any working linux-next build for testing for a very
long time now.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] mm: Two small fixes for recent syzbot reports
  2020-04-09 12:55   ` Dmitry Vyukov
@ 2020-04-09 16:32     ` Linus Torvalds
  2020-04-09 16:58       ` Qian Cai
  2020-04-09 23:29       ` Stephen Rothwell
  0 siblings, 2 replies; 37+ messages in thread
From: Linus Torvalds @ 2020-04-09 16:32 UTC (permalink / raw)
  To: Dmitry Vyukov, Stephen Rothwell; +Cc: Andrew Morton, Peter Xu, LKML, Linux-MM

On Thu, Apr 9, 2020 at 5:55 AM Dmitry Vyukov <dvyukov@google.com> wrote:
>
> linux-next is boot-broken for more than a month and bugs are piling
> onto bugs, I've seen at least 3 different ones.
> syzbot can't get any working linux-next build for testing for a very
> long time now.

Ouch.

Ok, that's not good. It means that linux-next has basically only done
build-testing this whole cycle.

Stephen, Dmitry - is there some way linux-next could possibly kick out
trees more aggressively if syzbot can't even boot?

This merge window has seemed otherwise fairly smooth to me (famous
last words), and it's really sad how the nice page fault cleanups
ended up being such an ongoing pain when the problems _could_ have
been caught earlier. We started to get syzbot reports very quickly
after they got merged into my tree, so this is clearly something that
gets exercised well - but it would have been oh-so-much better if it
had gotten noticed in linux-next.

Kicking trees out of linux-next and making noise if they cause syzbot
failures might also make some maintainers react more..

                Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] mm: Two small fixes for recent syzbot reports
  2020-04-09 16:32     ` Linus Torvalds
@ 2020-04-09 16:58       ` Qian Cai
  2020-04-09 17:05         ` Linus Torvalds
  2020-04-09 23:29       ` Stephen Rothwell
  1 sibling, 1 reply; 37+ messages in thread
From: Qian Cai @ 2020-04-09 16:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dmitry Vyukov, Stephen Rothwell, Andrew Morton, Peter Xu, LKML, Linux-MM

> On Apr 9, 2020, at 12:32 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> Kicking trees out of linux-next and making noise if they cause syzbot
> failures might also make some maintainers react more..

On the other hand, this makes me worry who is testing on linux-next every day. The worst nightmare I am having right now is some maintainers pick up commits that only have been in -next for a few days and then push to the mainline but then it is becoming my burden to fix those commits in case they introduced regressions because it is much harder to revert patches once in mainline.

Kicking out of trees in linux-next on the other hand could make the situation worst unless we have a counter solution that make sure commits must be in -next for a certain time (a month?) before merged in mainline.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] mm: Two small fixes for recent syzbot reports
  2020-04-09 16:58       ` Qian Cai
@ 2020-04-09 17:05         ` Linus Torvalds
  2020-04-09 17:58           ` Qian Cai
  0 siblings, 1 reply; 37+ messages in thread
From: Linus Torvalds @ 2020-04-09 17:05 UTC (permalink / raw)
  To: Qian Cai
  Cc: Dmitry Vyukov, Stephen Rothwell, Andrew Morton, Peter Xu, LKML, Linux-MM

On Thu, Apr 9, 2020 at 9:58 AM Qian Cai <cai@lca.pw> wrote:
>
> On the other hand, this makes me worry who is testing on linux-next every day.

Well, probably not very many people outside of robots.

Which is fine, but is also why I'd like robot failures to then be a big deal.

           Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] mm: Two small fixes for recent syzbot reports
  2020-04-09 17:05         ` Linus Torvalds
@ 2020-04-09 17:58           ` Qian Cai
  2020-04-09 18:06             ` Linus Torvalds
  0 siblings, 1 reply; 37+ messages in thread
From: Qian Cai @ 2020-04-09 17:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dmitry Vyukov, Stephen Rothwell, Andrew Morton, Peter Xu, LKML, Linux-MM

> On Apr 9, 2020, at 1:05 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> Well, probably not very many people outside of robots.
> 
> Which is fine, but is also why I'd like robot failures to then be a big deal.

Agree to make a big deal part. My point is that when kicking trees of linux-next, it also could reduce the exposure of many patches (which could be bad) to linux-next and miss valuable early testing either from robots or human. Thus, the same mistakes could happen again because maintainers could simply push those little or none linux-next exposure patches to mainline with no restrictions. There is a balance to strike.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] mm: Two small fixes for recent syzbot reports
  2020-04-09 17:58           ` Qian Cai
@ 2020-04-09 18:06             ` Linus Torvalds
  2020-04-09 21:14               ` Qian Cai
  0 siblings, 1 reply; 37+ messages in thread
From: Linus Torvalds @ 2020-04-09 18:06 UTC (permalink / raw)
  To: Qian Cai
  Cc: Dmitry Vyukov, Stephen Rothwell, Andrew Morton, Peter Xu, LKML, Linux-MM

On Thu, Apr 9, 2020 at 10:58 AM Qian Cai <cai@lca.pw> wrote:
>
> Agree to make a big deal part. My point is that when kicking trees of linux-next, it also could reduce the exposure of many patches (which could be bad) to linux-next and miss valuable early testing either from robots or human.

Sure. But I'd want to be notified when something gets kicked out, so
that I then know not to pull it.

So it would reduce the exposure of patches, but it would also make
sure those patches then don't make it upstream.

Untested patches is fine - as long as nobody else has to suffer through them.

               Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] mm: Two small fixes for recent syzbot reports
  2020-04-09 18:06             ` Linus Torvalds
@ 2020-04-09 21:14               ` Qian Cai
  2020-04-10 13:12                 ` Tetsuo Handa
  0 siblings, 1 reply; 37+ messages in thread
From: Qian Cai @ 2020-04-09 21:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dmitry Vyukov, Stephen Rothwell, Andrew Morton, Peter Xu, LKML, Linux-MM



> On Apr 9, 2020, at 2:06 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> On Thu, Apr 9, 2020 at 10:58 AM Qian Cai <cai@lca.pw> wrote:
>> 
>> Agree to make a big deal part. My point is that when kicking trees of linux-next, it also could reduce the exposure of many patches (which could be bad) to linux-next and miss valuable early testing either from robots or human.
> 
> Sure. But I'd want to be notified when something gets kicked out, so
> that I then know not to pull it.
> 
> So it would reduce the exposure of patches, but it would also make
> sure those patches then don't make it upstream.
> 
> Untested patches is fine - as long as nobody else has to suffer through them.

Excellent. It now very much depends on how Stephen will notify you when
a tree, a patchset or even a developer should be blacklisted for some time
to make this a success.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] mm: Two small fixes for recent syzbot reports
  2020-04-09 21:14               ` Qian Cai
@ 2020-04-10 13:12                 ` Tetsuo Handa
  2020-04-10 14:26                   ` Qian Cai
  0 siblings, 1 reply; 37+ messages in thread
From: Tetsuo Handa @ 2020-04-10 13:12 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Qian Cai, Linus Torvalds, Stephen Rothwell, Andrew Morton,
	Peter Xu, LKML, Linux-MM

On 2020/04/10 6:14, Qian Cai wrote:
> 
> 
>> On Apr 9, 2020, at 2:06 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>>
>> On Thu, Apr 9, 2020 at 10:58 AM Qian Cai <cai@lca.pw> wrote:
>>>
>>> Agree to make a big deal part. My point is that when kicking trees of linux-next, it also could reduce the exposure of many patches (which could be bad) to linux-next and miss valuable early testing either from robots or human.
>>
>> Sure. But I'd want to be notified when something gets kicked out, so
>> that I then know not to pull it.
>>
>> So it would reduce the exposure of patches, but it would also make
>> sure those patches then don't make it upstream.
>>
>> Untested patches is fine - as long as nobody else has to suffer through them.
> 
> Excellent. It now very much depends on how Stephen will notify you when
> a tree, a patchset or even a developer should be blacklisted for some time
> to make this a success.
> 

Since patch flow forms tree structure, I don't know whether maintainers can
afford remembering which tree, patchset or developer should be blacklisted
when problems come from leaf git trees.



By the way...

Removing problematic trees might confuse "#syz test:" request, for
developers might ask syzbot to test proposed patches on a kernel which
does not contain problematic trees. In lucky case, test request fails
as patch failure or build failure. But in unlucky case, syzbot fails to
detect that proposed patch was tested on a kernel without problematic
trees. A bit related to https://github.com/google/syzkaller/issues/1609 .

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] mm: Two small fixes for recent syzbot reports
  2020-04-10 13:12                 ` Tetsuo Handa
@ 2020-04-10 14:26                   ` Qian Cai
  2020-04-10 17:26                     ` Andrew Morton
  0 siblings, 1 reply; 37+ messages in thread
From: Qian Cai @ 2020-04-10 14:26 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Dmitry Vyukov, Linus Torvalds, Stephen Rothwell, Andrew Morton,
	Peter Xu, LKML, Linux-MM

> On Apr 10, 2020, at 9:12 AM, Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:
> 
> On 2020/04/10 6:14, Qian Cai wrote:
>> 
>> 
>>> On Apr 9, 2020, at 2:06 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>>> 
>>> On Thu, Apr 9, 2020 at 10:58 AM Qian Cai <cai@lca.pw> wrote:
>>>> 
>>>> Agree to make a big deal part. My point is that when kicking trees of linux-next, it also could reduce the exposure of many patches (which could be bad) to linux-next and miss valuable early testing either from robots or human.
>>> 
>>> Sure. But I'd want to be notified when something gets kicked out, so
>>> that I then know not to pull it.
>>> 
>>> So it would reduce the exposure of patches, but it would also make
>>> sure those patches then don't make it upstream.
>>> 
>>> Untested patches is fine - as long as nobody else has to suffer through them.
>> 
>> Excellent. It now very much depends on how Stephen will notify you when
>> a tree, a patchset or even a developer should be blacklisted for some time
>> to make this a success.
>> 
> 
> Since patch flow forms tree structure, I don't know whether maintainers can
> afford remembering which tree, patchset or developer should be blacklisted
> when problems come from leaf git trees.
> 
> 
> 
> By the way...
> 
> Removing problematic trees might confuse "#syz test:" request, for
> developers might ask syzbot to test proposed patches on a kernel which
> does not contain problematic trees. In lucky case, test request fails
> as patch failure or build failure. But in unlucky case, syzbot fails to
> detect that proposed patch was tested on a kernel without problematic
> trees. A bit related to https://github.com/google/syzkaller/issues/1609 .
> 

I looked at those blocking bug list sent by Dmitry. I wonder “boys, why they
did’t send those out earlier to linux-next or somewhere more visible?” because
I had dealt with most of those before, and I knew the solutions to unblock them!

Even though my testing setup is somewhat different from syzbot. I don’t do
fuzzers, and my config is only focus on mm, iommu and a few core kernel pieces
with more debugging options on, but it does bare-metal and multi-arch, there are
still lots of opportunities to help each other with dealing with blocking issues.

A few things I am doing differently with syzbot on linux-next where would help to
be run continuous without blocking most of the time are,

I don’t set panic_on_warn. I’ll deal with warnings afterwards.

Occasionally, there are hard failures that I have to deal with right now. I’ll get to
the end of it, and figured out the exact commit caused it.

In syzbot mode, the bisection (by robot) is the hard part, because if you don’t
figure out the exact commit, most of times people (CC by the bug reports) would
have no clue and will be ignored. (even if the bad commit was figured out, it is
not 100% guaranteed developers would know what’s going on but it helps
dramatically, and at least we can revert it without blocking if everything else fails).

Thus, it would be really help if syzbot (or human operators) could help bisect, even
if it could only figure out one of merge commit in linux-next is bad (where with high
accuracy) and may get those ignored less.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] mm: Two small fixes for recent syzbot reports
  2020-04-10 14:26                   ` Qian Cai
@ 2020-04-10 17:26                     ` Andrew Morton
  2020-04-10 19:46                       ` Qian Cai
  0 siblings, 1 reply; 37+ messages in thread
From: Andrew Morton @ 2020-04-10 17:26 UTC (permalink / raw)
  To: Qian Cai
  Cc: Tetsuo Handa, Dmitry Vyukov, Linus Torvalds, Stephen Rothwell,
	Peter Xu, LKML, Linux-MM

On Fri, 10 Apr 2020 10:26:23 -0400 Qian Cai <cai@lca.pw> wrote:

> I don't set panic_on_warn. I'll deal with warnings afterwards.

I'm not understanding why sysbot sets panic_on_warn.  This decision
will needlessly turn many kernel errors into wont-boot situations and
will block further testing?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] mm: Two small fixes for recent syzbot reports
  2020-04-10 17:26                     ` Andrew Morton
@ 2020-04-10 19:46                       ` Qian Cai
  0 siblings, 0 replies; 37+ messages in thread
From: Qian Cai @ 2020-04-10 19:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tetsuo Handa, Dmitry Vyukov, Linus Torvalds, Stephen Rothwell,
	Peter Xu, LKML, Linux-MM

> On Apr 10, 2020, at 1:26 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> On Fri, 10 Apr 2020 10:26:23 -0400 Qian Cai <cai@lca.pw> wrote:
> 
>> I don't set panic_on_warn. I'll deal with warnings afterwards.
> 
> I'm not understanding why sysbot sets panic_on_warn.  This decision
> will needlessly turn many kernel errors into wont-boot situations and
> will block further testing?

I can feel that it is very reasonable to set panic_on_warn for the fully
automatic systems because once those warnings happen, the rest of
things can no longer to be trusted, so the goal to kill the first enemy
on sight, and then deal the next one.

It could be a good idea for some trees more stable like the mainline,
but for linux-next, I could only dream of set it one day.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] mm: Two small fixes for recent syzbot reports
  2020-04-09 16:32     ` Linus Torvalds
  2020-04-09 16:58       ` Qian Cai
@ 2020-04-09 23:29       ` Stephen Rothwell
  2020-04-13 22:06         ` Qian Cai
  1 sibling, 1 reply; 37+ messages in thread
From: Stephen Rothwell @ 2020-04-09 23:29 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Dmitry Vyukov, Andrew Morton, Peter Xu, LKML, Linux-MM

[-- Attachment #1: Type: text/plain, Size: 1535 bytes --]

Hi Linus,

On Thu, 9 Apr 2020 09:32:32 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> On Thu, Apr 9, 2020 at 5:55 AM Dmitry Vyukov <dvyukov@google.com> wrote:
> >
> > linux-next is boot-broken for more than a month and bugs are piling
> > onto bugs, I've seen at least 3 different ones.
> > syzbot can't get any working linux-next build for testing for a very
> > long time now.  
> 
> Ouch.
> 
> Ok, that's not good. It means that linux-next has basically only done
> build-testing this whole cycle.

Well, there are other CI's beyond syzbot .. Does syzbot only build/test
a single kernel arch/config?

> Stephen, Dmitry - is there some way linux-next could possibly kick out
> trees more aggressively if syzbot can't even boot?

Of course that could be done if I knew that there were problems.  From
memory and my mail archives, I was only cc'd on 3 problems by sysbot
since last November and they were all responded to by the appropriate
maintainers/developers.

Currently, when I am cc'd on reports, if they are also sent to who
seem like the appropriate people, I just file the report assuming it
will be dealt with.

> Kicking trees out of linux-next and making noise if they cause syzbot
> failures might also make some maintainers react more..

That may be true, but in some cases I have carried fixups/reverts/older
versions of trees for quite some time before things get fixed.  But at
least if that happens, I do tend to remind people.

-- 
Cheers,
Stephen Rothwell

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] mm: Two small fixes for recent syzbot reports
  2020-04-09 23:29       ` Stephen Rothwell
@ 2020-04-13 22:06         ` Qian Cai
  2020-04-13 23:05           ` Jens Axboe
  2020-04-14 11:12           ` Dmitry Vyukov
  0 siblings, 2 replies; 37+ messages in thread
From: Qian Cai @ 2020-04-13 22:06 UTC (permalink / raw)
  To: Linus Torvalds, Stephen Rothwell, Andrew Morton
  Cc: Dmitry Vyukov, Peter Xu, LKML, Linux-MM, Jens Axboe,
	Christoph Lameter, Johannes Weiner



> On Apr 9, 2020, at 7:29 PM, Stephen Rothwell <sfr@canb.auug.org.au> wrote:
> 
> Hi Linus,
> 
> On Thu, 9 Apr 2020 09:32:32 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote:
>> 
>> On Thu, Apr 9, 2020 at 5:55 AM Dmitry Vyukov <dvyukov@google.com> wrote:
>>> 
>>> linux-next is boot-broken for more than a month and bugs are piling
>>> onto bugs, I've seen at least 3 different ones.
>>> syzbot can't get any working linux-next build for testing for a very
>>> long time now.  
>> 
>> Ouch.
>> 
>> Ok, that's not good. It means that linux-next has basically only done
>> build-testing this whole cycle.
> 
> Well, there are other CI's beyond syzbot .. Does syzbot only build/test
> a single kernel arch/config?
> 
>> Stephen, Dmitry - is there some way linux-next could possibly kick out
>> trees more aggressively if syzbot can't even boot?
> 
> Of course that could be done if I knew that there were problems.  From
> memory and my mail archives, I was only cc'd on 3 problems by sysbot
> since last November and they were all responded to by the appropriate
> maintainers/developers.
> 
> Currently, when I am cc'd on reports, if they are also sent to who
> seem like the appropriate people, I just file the report assuming it
> will be dealt with.
> 
>> Kicking trees out of linux-next and making noise if they cause syzbot
>> failures might also make some maintainers react more..
> 
> That may be true, but in some cases I have carried fixups/reverts/older
> versions of trees for quite some time before things get fixed.  But at
> least if that happens, I do tend to remind people.

BTW, I’ll be adding fuzzers to my daily linux-next routines where it triggers this
io_uring/scheduler bug almost immediately, so hopefully would buy syzbot some
time to resume on linux-next.

[67493.516737][T211750] BUG: unable to handle page fault for address: ffffffffffffffe8
[67493.557315][T211750] #PF: supervisor read access in kernel mode
[67493.586726][T211750] #PF: error_code(0x0000) - not-present page
[67493.614434][T211750] PGD f96e17067 P4D f96e17067 PUD f96e19067 PMD 0 
[67493.644846][T211750] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
[67493.674127][T211750] CPU: 55 PID: 211750 Comm: trinity-c127 Tainted: G    B        L    5.7.0-rc1-next-20200413 #4
[67493.722516][T211750] Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 04/12/2017
[67493.764925][T211750] RIP: 0010:__wake_up_common+0x98/0x290
__wake_up_common at kernel/sched/wait.c:87
[67493.790675][T211750] Code: 40 4d 8d 78 e8 49 8d 7f 18 49 39 fd 0f 84 80 00 00 00 e8 6b bd 2b 00 49 8b 5f 18 45 31 e4 48 83 eb 18 4c 89 ff e8 08 bc 2b 00 <45> 8b 37 41 f6 c6 04 75 71 49 8d 7f 10 e8 46 bd 2b 00 49 8b 47 10
[67493.881650][T211750] RSP: 0018:ffffc9000adbfaf0 EFLAGS: 00010046
[67493.909854][T211750] RAX: 0000000000000000 RBX: ffffffffffffffe8 RCX: ffffffffaa9636b8
[67493.947131][T211750] RDX: 0000000000000003 RSI: dffffc0000000000 RDI: ffffffffffffffe8
[67493.983829][T211750] RBP: ffffc9000adbfb40 R08: fffffbfff582c5fd R09: fffffbfff582c5fd
[67494.020861][T211750] R10: ffffffffac162fe3 R11: fffffbfff582c5fc R12: 0000000000000000
[67494.059249][T211750] R13: ffff888ef82b0960 R14: ffffc9000adbfb80 R15: ffffffffffffffe8
[67494.099699][T211750] FS:  00007fdcba4c4740(0000) GS:ffff889033780000(0000) knlGS:0000000000000000
[67494.141858][T211750] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[67494.172660][T211750] CR2: ffffffffffffffe8 CR3: 0000000f776a0004 CR4: 00000000001606e0
[67494.209760][T211750] Call Trace:
[67494.224720][T211750]  __wake_up_common_lock+0xea/0x150
(inlined by) __wake_up_common_lock at kernel/sched/wait.c:124
[67494.248753][T211750]  ? __wake_up_common+0x290/0x290
[67494.272014][T211750]  ? lockdep_hardirqs_on+0x16/0x2c0
[67494.296139][T211750]  __wake_up+0x13/0x20
[67494.314946][T211750]  io_cqring_ev_posted+0x75/0xe0
(inlined by) io_cqring_ev_posted at fs/io_uring.c:1160
[67494.337726][T211750]  io_ring_ctx_wait_and_kill+0x1c0/0x2f0
io_ring_ctx_wait_and_kill at fs/io_uring.c:7305
[67494.363840][T211750]  io_uring_create+0xa8d/0x13b0
[67494.386526][T211750]  ? io_req_defer_prep+0x990/0x990
[67494.410119][T211750]  ? __kasan_check_write+0x14/0x20
[67494.433646][T211750]  io_uring_setup+0xb8/0x130
[67494.454870][T211750]  ? io_uring_create+0x13b0/0x13b0
[67494.478342][T211750]  ? check_flags.part.28+0x220/0x220
[67494.502947][T211750]  ? lockdep_hardirqs_on+0x16/0x2c0
[67494.526965][T211750]  __x64_sys_io_uring_setup+0x31/0x40
[67494.551820][T211750]  do_syscall_64+0xcc/0xaf0
[67494.574829][T211750]  ? syscall_return_slowpath+0x580/0x580
[67494.604591][T211750]  ? lockdep_hardirqs_off+0x1f/0x140
[67494.628901][T211750]  ? entry_SYSCALL_64_after_hwframe+0x3e/0xb3
[67494.657616][T211750]  ? trace_hardirqs_off_caller+0x3a/0x150
[67494.683999][T211750]  ? trace_hardirqs_off_thunk+0x1a/0x1c
[67494.709982][T211750]  entry_SYSCALL_64_after_hwframe+0x49/0xb3
[67494.737167][T211750] RIP: 0033:0x7fdcb9dd76ed
[67494.757698][T211750] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 6b 57 2c 00 f7 d8 64 89 01 48
[67494.849485][T211750] RSP: 002b:00007ffe7fd4e4f8 EFLAGS: 00000246 ORIG_RAX: 00000000000001a9
[67494.887906][T211750] RAX: ffffffffffffffda RBX: 00000000000001a9 RCX: 00007fdcb9dd76ed
[67494.924754][T211750] RDX: fffffffffffffffc RSI: 0000000000000000 RDI: 0000000000005d54
[67494.961516][T211750] RBP: 00000000000001a9 R08: 0000000e31d3caa7 R09: 0082400004004000
[67494.998485][T211750] R10: ffffffffffffffff R11: 0000000000000246 R12: 0000000000000002
[67495.035510][T211750] R13: 00007fdcb842e058 R14: 00007fdcba4c46c0 R15: 00007fdcb842e000
[67495.072679][T211750] Modules linked in: bridge stp llc nfnetlink cn brd vfat fat ext4 crc16 mbcache jbd2 loop kvm_intel kvm irqbypass intel_cstate intel_uncore dax_pmem intel_rapl_perf dax_pmem_core ip_tables x_tables xfs sd_mod tg3 firmware_class libphy hpsa scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod [last unloaded: binfmt_misc]
[67495.221120][T211750] CR2: ffffffffffffffe8
[67495.240237][T211750] ---[ end trace f9502383d57e0e22 ]---
[67495.265301][T211750] RIP: 0010:__wake_up_common+0x98/0x290
[67495.290903][T211750] Code: 40 4d 8d 78 e8 49 8d 7f 18 49 39 fd 0f 84 80 00 00 00 e8 6b bd 2b 00 49 8b 5f 18 45 31 e4 48 83 eb 18 4c 89 ff e8 08 bc 2b 00 <45> 8b 37 41 f6 c6 04 75 71 49 8d 7f 10 e8 46 bd 2b 00 49 8b 47 10
[67495.382302][T211750] RSP: 0018:ffffc9000adbfaf0 EFLAGS: 00010046
[67495.410551][T211750] RAX: 0000000000000000 RBX: ffffffffffffffe8 RCX: ffffffffaa9636b8
[67495.447570][T211750] RDX: 0000000000000003 RSI: dffffc0000000000 RDI: ffffffffffffffe8
[67495.484252][T211750] RBP: ffffc9000adbfb40 R08: fffffbfff582c5fd R09: fffffbfff582c5fd
[67495.521068][T211750] R10: ffffffffac162fe3 R11: fffffbfff582c5fc R12: 0000000000000000
[67495.557461][T211750] R13: ffff888ef82b0960 R14: ffffc9000adbfb80 R15: ffffffffffffffe8
[67495.594607][T211750] FS:  00007fdcba4c4740(0000) GS:ffff889033780000(0000) knlGS:0000000000000000
[67495.639332][T211750] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[67495.669033][T211750] CR2: ffffffffffffffe8 CR3: 0000000f776a0004 CR4: 00000000001606e0
[67495.704569][T211750] Kernel panic - not syncing: Fatal exception
[67495.731758][T211750] Kernel Offset: 0x29800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[67495.784988][T211750] ---[ end Kernel panic - not syncing: Fatal exception ]—

Also, I’ll need to deal with this slub sysfs and memcg lockdep splat first, so lockdep would still be functioning during the fuzzing.

[ 8137.254287][T53013] WARNING: possible circular locking dependency detected
[ 8137.261231][T53013] 5.7.0-rc1-next-20200413+ #2 Not tainted
[ 8137.266981][T53013] ------------------------------------------------------
[ 8137.274016][T53013] trinity-c10/53013 is trying to acquire lock:
[ 8137.280127][T53013] ffffffff89ad2968 (slab_mutex){+.+.}-{3:3}, at: slab_attr_store+0x79/0xf0
[ 8137.288660][T53013] 
[ 8137.288660][T53013] but task is already holding lock:
[ 8137.295944][T53013] ffff9ea421940dd8 (kn->count#88){++++}-{0:0}, at: kernfs_fop_write+0x10e/0x280
[ 8137.305004][T53013] 
[ 8137.305004][T53013] which lock already depends on the new lock.
[ 8137.305004][T53013] 
[ 8137.315347][T53013] 
[ 8137.315347][T53013] the existing dependency chain (in reverse order) is:
[ 8137.324470][T53013] 
[ 8137.324470][T53013] -> #1 (kn->count#88){++++}-{0:0}:
[ 8137.331774][T53013]        __kernfs_remove+0x3bb/0x420
[ 8137.336977][T53013]        kernfs_remove+0x2c/0x40
[ 8137.341829][T53013]        sysfs_remove_dir+0x7e/0x90
[ 8137.346988][T53013]        kobject_del+0x60/0xb0
[ 8137.351663][T53013]        sysfs_slab_unlink+0x1c/0x20
[ 8137.356853][T53013]        shutdown_cache+0x155/0x1c0
[ 8137.361958][T53013]        kmemcg_cache_shutdown_fn+0xe/0x20
[ 8137.367756][T53013]        kmemcg_workfn+0x35/0x50
[ 8137.372705][T53013]        process_one_work+0x560/0xba0
[ 8137.377985][T53013]        worker_thread+0x80/0x5f0
[ 8137.382920][T53013]        kthread+0x1db/0x200
[ 8137.387467][T53013]        ret_from_fork+0x27/0x50
[ 8137.392296][T53013] 
[ 8137.392296][T53013] -> #0 (slab_mutex){+.+.}-{3:3}:
[ 8137.399587][T53013]        __lock_acquire+0x1673/0x23a0
[ 8137.404865][T53013]        lock_acquire+0xcd/0x410
[ 8137.409710][T53013]        __mutex_lock+0xc9/0xbf0
[ 8137.414561][T53013]        mutex_lock_nested+0x31/0x40
[ 8137.419823][T53013]        slab_attr_store+0x79/0xf0
[ 8137.424991][T53013]        sysfs_kf_write+0x9a/0xb0
[ 8137.429919][T53013]        kernfs_fop_write+0x15e/0x280
[ 8137.435272][T53013]        do_iter_write+0x261/0x2c0
[ 8137.440291][T53013]        vfs_writev+0xe6/0x170
[ 8137.444995][T53013]        do_pwritev+0xab/0xd0
[ 8137.449731][T53013]        __x64_sys_pwritev+0x61/0x80
[ 8137.455079][T53013]        do_syscall_64+0x91/0xb10
[ 8137.460010][T53013]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
[ 8137.466330][T53013] 
[ 8137.466330][T53013] other info that might help us debug this:
[ 8137.466330][T53013] 
[ 8137.476695][T53013]  Possible unsafe locking scenario:
[ 8137.476695][T53013] 
[ 8137.484067][T53013]        CPU0                    CPU1
[ 8137.489337][T53013]        ----                    ----
[ 8137.494609][T53013]   lock(kn->count#88);
[ 8137.498718][T53013]                                lock(slab_mutex);
[ 8137.505130][T53013]                                lock(kn->count#88);
[ 8137.511718][T53013]   lock(slab_mutex);
[ 8137.515594][T53013] 
[ 8137.515594][T53013]  *** DEADLOCK ***
[ 8137.515594][T53013] 
[ 8137.523780][T53013] 3 locks held by trinity-c10/53013:
[ 8137.528967][T53013]  #0: ffff9ea42c841430 (sb_writers#3){.+.+}-{0:0}, at: vfs_writev+0x13f/0x170
[ 8137.537847][T53013]  #1: ffff9ea3c2c9c288 (&of->mutex){+.+.}-{3:3}, at: kernfs_fop_write+0xfe/0x280
[ 8137.547149][T53013]  #2: ffff9ea421940dd8 (kn->count#88){++++}-{0:0}, at: kernfs_fop_write+0x10e/0x280
[ 8137.556554][T53013] 
[ 8137.556554][T53013] stack backtrace:
[ 8137.562360][T53013] CPU: 22 PID: 53013 Comm: trinity-c10 Not tainted 5.7.0-rc1-next-20200413+ #2
[ 8137.571216][T53013] Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 03/09/2018
[ 8137.580670][T53013] Call Trace:
[ 8137.583865][T53013]  dump_stack+0xa4/0x100
[ 8137.588105][T53013]  print_circular_bug.cold.49+0x13c/0x141
[ 8137.593746][T53013]  check_noncircular+0x183/0x1a0
[ 8137.598818][T53013]  ? do_raw_spin_unlock+0x10b/0x1c0
[ 8137.603971][T53013]  __lock_acquire+0x1673/0x23a0
[ 8137.608732][T53013]  ? print_unreferenced+0x224/0x230
[ 8137.613846][T53013]  lock_acquire+0xcd/0x410
[ 8137.618169][T53013]  ? slab_attr_store+0x79/0xf0
[ 8137.619804][T52801] VFS: "mand" mount option not supported
[ 8137.623085][T53013]  __mutex_lock+0xc9/0xbf0
[ 8137.632939][T53013]  ? slab_attr_store+0x79/0xf0
[ 8137.637679][T53013]  ? _find_next_bit.constprop.1+0xd0/0x100
[ 8137.643477][T53013]  ? slab_attr_store+0x79/0xf0
[ 8137.648252][T53013]  ? find_next_bit+0x36/0x40
[ 8137.652748][T53013]  ? cpumask_next+0x46/0x60
[ 8137.657155][T53013]  mutex_lock_nested+0x31/0x40
[ 8137.661897][T53013]  ? slab_attr_show+0x30/0x30
[ 8137.666476][T53013]  ? mutex_lock_nested+0x31/0x40
[ 8137.671373][T53013]  slab_attr_store+0x79/0xf0
[ 8137.676035][T53013]  ? slab_attr_show+0x30/0x30
[ 8137.680672][T53013]  sysfs_kf_write+0x9a/0xb0
[ 8137.685085][T53013]  ? sysfs_file_ops+0xb0/0xb0
[ 8137.689673][T53013]  kernfs_fop_write+0x15e/0x280
[ 8137.694433][T53013]  do_iter_write+0x261/0x2c0
[ 8137.698997][T53013]  vfs_writev+0xe6/0x170
[ 8137.703142][T53013]  ? lock_acquire+0xcd/0x410
[ 8137.707696][T53013]  ? find_held_lock+0x35/0xa0
[ 8137.712279][T53013]  ? __fget_light+0xa3/0x170
[ 8137.716897][T53013]  do_pwritev+0xab/0xd0
[ 8137.720960][T53013]  __x64_sys_pwritev+0x61/0x80
[ 8137.725808][T53013]  do_syscall_64+0x91/0xb10
[ 8137.730274][T53013]  ? trace_hardirqs_off_thunk+0x1a/0x1c
[ 8137.735733][T53013]  entry_SYSCALL_64_after_hwframe+0x49/0xb3
[ 8137.741588][T53013] RIP: 0033:0x7f54d0a386ed

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] mm: Two small fixes for recent syzbot reports
  2020-04-13 22:06         ` Qian Cai
@ 2020-04-13 23:05           ` Jens Axboe
  2020-04-14 11:12           ` Dmitry Vyukov
  1 sibling, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2020-04-13 23:05 UTC (permalink / raw)
  To: Qian Cai, Linus Torvalds, Stephen Rothwell, Andrew Morton
  Cc: Dmitry Vyukov, Peter Xu, LKML, Linux-MM, Christoph Lameter,
	Johannes Weiner

On 4/13/20 4:06 PM, Qian Cai wrote:
> 
> 
>> On Apr 9, 2020, at 7:29 PM, Stephen Rothwell <sfr@canb.auug.org.au> wrote:
>>
>> Hi Linus,
>>
>> On Thu, 9 Apr 2020 09:32:32 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote:
>>>
>>> On Thu, Apr 9, 2020 at 5:55 AM Dmitry Vyukov <dvyukov@google.com> wrote:
>>>>
>>>> linux-next is boot-broken for more than a month and bugs are piling
>>>> onto bugs, I've seen at least 3 different ones.
>>>> syzbot can't get any working linux-next build for testing for a very
>>>> long time now.  
>>>
>>> Ouch.
>>>
>>> Ok, that's not good. It means that linux-next has basically only done
>>> build-testing this whole cycle.
>>
>> Well, there are other CI's beyond syzbot .. Does syzbot only build/test
>> a single kernel arch/config?
>>
>>> Stephen, Dmitry - is there some way linux-next could possibly kick out
>>> trees more aggressively if syzbot can't even boot?
>>
>> Of course that could be done if I knew that there were problems.  From
>> memory and my mail archives, I was only cc'd on 3 problems by sysbot
>> since last November and they were all responded to by the appropriate
>> maintainers/developers.
>>
>> Currently, when I am cc'd on reports, if they are also sent to who
>> seem like the appropriate people, I just file the report assuming it
>> will be dealt with.
>>
>>> Kicking trees out of linux-next and making noise if they cause syzbot
>>> failures might also make some maintainers react more..
>>
>> That may be true, but in some cases I have carried fixups/reverts/older
>> versions of trees for quite some time before things get fixed.  But at
>> least if that happens, I do tend to remind people.
> 
> BTW, I’ll be adding fuzzers to my daily linux-next routines where it
> triggers this io_uring/scheduler bug almost immediately, so hopefully
> would buy syzbot some time to resume on linux-next.
> 
> [67493.516737][T211750] BUG: unable to handle page fault for address: ffffffffffffffe8
> [67493.557315][T211750] #PF: supervisor read access in kernel mode
> [67493.586726][T211750] #PF: error_code(0x0000) - not-present page
> [67493.614434][T211750] PGD f96e17067 P4D f96e17067 PUD f96e19067 PMD 0 
> [67493.644846][T211750] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
> [67493.674127][T211750] CPU: 55 PID: 211750 Comm: trinity-c127 Tainted: G    B        L    5.7.0-rc1-next-20200413 #4
> [67493.722516][T211750] Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 04/12/2017
> [67493.764925][T211750] RIP: 0010:__wake_up_common+0x98/0x290
> __wake_up_common at kernel/sched/wait.c:87
> [67493.790675][T211750] Code: 40 4d 8d 78 e8 49 8d 7f 18 49 39 fd 0f 84 80 00 00 00 e8 6b bd 2b 00 49 8b 5f 18 45 31 e4 48 83 eb 18 4c 89 ff e8 08 bc 2b 00 <45> 8b 37 41 f6 c6 04 75 71 49 8d 7f 10 e8 46 bd 2b 00 49 8b 47 10
> [67493.881650][T211750] RSP: 0018:ffffc9000adbfaf0 EFLAGS: 00010046
> [67493.909854][T211750] RAX: 0000000000000000 RBX: ffffffffffffffe8 RCX: ffffffffaa9636b8
> [67493.947131][T211750] RDX: 0000000000000003 RSI: dffffc0000000000 RDI: ffffffffffffffe8
> [67493.983829][T211750] RBP: ffffc9000adbfb40 R08: fffffbfff582c5fd R09: fffffbfff582c5fd
> [67494.020861][T211750] R10: ffffffffac162fe3 R11: fffffbfff582c5fc R12: 0000000000000000
> [67494.059249][T211750] R13: ffff888ef82b0960 R14: ffffc9000adbfb80 R15: ffffffffffffffe8
> [67494.099699][T211750] FS:  00007fdcba4c4740(0000) GS:ffff889033780000(0000) knlGS:0000000000000000
> [67494.141858][T211750] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [67494.172660][T211750] CR2: ffffffffffffffe8 CR3: 0000000f776a0004 CR4: 00000000001606e0
> [67494.209760][T211750] Call Trace:
> [67494.224720][T211750]  __wake_up_common_lock+0xea/0x150
> (inlined by) __wake_up_common_lock at kernel/sched/wait.c:124
> [67494.248753][T211750]  ? __wake_up_common+0x290/0x290
> [67494.272014][T211750]  ? lockdep_hardirqs_on+0x16/0x2c0
> [67494.296139][T211750]  __wake_up+0x13/0x20
> [67494.314946][T211750]  io_cqring_ev_posted+0x75/0xe0
> (inlined by) io_cqring_ev_posted at fs/io_uring.c:1160
> [67494.337726][T211750]  io_ring_ctx_wait_and_kill+0x1c0/0x2f0
> io_ring_ctx_wait_and_kill at fs/io_uring.c:7305
> [67494.363840][T211750]  io_uring_create+0xa8d/0x13b0
> [67494.386526][T211750]  ? io_req_defer_prep+0x990/0x990
> [67494.410119][T211750]  ? __kasan_check_write+0x14/0x20
> [67494.433646][T211750]  io_uring_setup+0xb8/0x130
> [67494.454870][T211750]  ? io_uring_create+0x13b0/0x13b0
> [67494.478342][T211750]  ? check_flags.part.28+0x220/0x220
> [67494.502947][T211750]  ? lockdep_hardirqs_on+0x16/0x2c0
> [67494.526965][T211750]  __x64_sys_io_uring_setup+0x31/0x40
> [67494.551820][T211750]  do_syscall_64+0xcc/0xaf0
> [67494.574829][T211750]  ? syscall_return_slowpath+0x580/0x580
> [67494.604591][T211750]  ? lockdep_hardirqs_off+0x1f/0x140
> [67494.628901][T211750]  ? entry_SYSCALL_64_after_hwframe+0x3e/0xb3
> [67494.657616][T211750]  ? trace_hardirqs_off_caller+0x3a/0x150
> [67494.683999][T211750]  ? trace_hardirqs_off_thunk+0x1a/0x1c
> [67494.709982][T211750]  entry_SYSCALL_64_after_hwframe+0x49/0xb3
> [67494.737167][T211750] RIP: 0033:0x7fdcb9dd76ed
> [67494.757698][T211750] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 6b 57 2c 00 f7 d8 64 89 01 48
> [67494.849485][T211750] RSP: 002b:00007ffe7fd4e4f8 EFLAGS: 00000246 ORIG_RAX: 00000000000001a9
> [67494.887906][T211750] RAX: ffffffffffffffda RBX: 00000000000001a9 RCX: 00007fdcb9dd76ed
> [67494.924754][T211750] RDX: fffffffffffffffc RSI: 0000000000000000 RDI: 0000000000005d54
> [67494.961516][T211750] RBP: 00000000000001a9 R08: 0000000e31d3caa7 R09: 0082400004004000
> [67494.998485][T211750] R10: ffffffffffffffff R11: 0000000000000246 R12: 0000000000000002
> [67495.035510][T211750] R13: 00007fdcb842e058 R14: 00007fdcba4c46c0 R15: 00007fdcb842e000
> [67495.072679][T211750] Modules linked in: bridge stp llc nfnetlink cn brd vfat fat ext4 crc16 mbcache jbd2 loop kvm_intel kvm irqbypass intel_cstate intel_uncore dax_pmem intel_rapl_perf dax_pmem_core ip_tables x_tables xfs sd_mod tg3 firmware_class libphy hpsa scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod [last unloaded: binfmt_misc]
> [67495.221120][T211750] CR2: ffffffffffffffe8
> [67495.240237][T211750] ---[ end trace f9502383d57e0e22 ]---
> [67495.265301][T211750] RIP: 0010:__wake_up_common+0x98/0x290
> [67495.290903][T211750] Code: 40 4d 8d 78 e8 49 8d 7f 18 49 39 fd 0f 84 80 00 00 00 e8 6b bd 2b 00 49 8b 5f 18 45 31 e4 48 83 eb 18 4c 89 ff e8 08 bc 2b 00 <45> 8b 37 41 f6 c6 04 75 71 49 8d 7f 10 e8 46 bd 2b 00 49 8b 47 10
> [67495.382302][T211750] RSP: 0018:ffffc9000adbfaf0 EFLAGS: 00010046
> [67495.410551][T211750] RAX: 0000000000000000 RBX: ffffffffffffffe8 RCX: ffffffffaa9636b8
> [67495.447570][T211750] RDX: 0000000000000003 RSI: dffffc0000000000 RDI: ffffffffffffffe8
> [67495.484252][T211750] RBP: ffffc9000adbfb40 R08: fffffbfff582c5fd R09: fffffbfff582c5fd
> [67495.521068][T211750] R10: ffffffffac162fe3 R11: fffffbfff582c5fc R12: 0000000000000000
> [67495.557461][T211750] R13: ffff888ef82b0960 R14: ffffc9000adbfb80 R15: ffffffffffffffe8
> [67495.594607][T211750] FS:  00007fdcba4c4740(0000) GS:ffff889033780000(0000) knlGS:0000000000000000
> [67495.639332][T211750] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [67495.669033][T211750] CR2: ffffffffffffffe8 CR3: 0000000f776a0004 CR4: 00000000001606e0
> [67495.704569][T211750] Kernel panic - not syncing: Fatal exception
> [67495.731758][T211750] Kernel Offset: 0x29800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> [67495.784988][T211750] ---[ end Kernel panic - not syncing: Fatal exception ]—

This looks like it can happen if you fail setting up the ring (you
probably have some error injection enabled), and then
io_poll_remove_all() does an unconditional io_cqring_ev_posted() even if
we didn't post any events.

I think the best fix here is to ensure io_cqring_ev_posted() is only
called if we actually post events. If we did, we know for a fact that
rings have been setup.

I'll queue this up for 5.7.

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 5190bfb6a665..c0aa72e738b4 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -4342,7 +4342,7 @@ static void io_poll_remove_all(struct io_ring_ctx *ctx)
 {
 	struct hlist_node *tmp;
 	struct io_kiocb *req;
-	int i;
+	int posted = 0, i;
 
 	spin_lock_irq(&ctx->completion_lock);
 	for (i = 0; i < (1U << ctx->cancel_hash_bits); i++) {
@@ -4350,11 +4350,12 @@ static void io_poll_remove_all(struct io_ring_ctx *ctx)
 
 		list = &ctx->cancel_hash[i];
 		hlist_for_each_entry_safe(req, tmp, list, hash_node)
-			io_poll_remove_one(req);
+			posted += io_poll_remove_one(req);
 	}
 	spin_unlock_irq(&ctx->completion_lock);
 
-	io_cqring_ev_posted(ctx);
+	if (posted)
+		io_cqring_ev_posted(ctx);
 }
 
 static int io_poll_cancel(struct io_ring_ctx *ctx, __u64 sqe_addr)

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] mm: Two small fixes for recent syzbot reports
  2020-04-13 22:06         ` Qian Cai
  2020-04-13 23:05           ` Jens Axboe
@ 2020-04-14 11:12           ` Dmitry Vyukov
  2020-04-14 11:59             ` Qian Cai
                               ` (2 more replies)
  1 sibling, 3 replies; 37+ messages in thread
From: Dmitry Vyukov @ 2020-04-14 11:12 UTC (permalink / raw)
  To: Qian Cai
  Cc: Linus Torvalds, Stephen Rothwell, Andrew Morton, Peter Xu, LKML,
	Linux-MM, Jens Axboe, Christoph Lameter, Johannes Weiner,
	syzkaller, Dan Rue

On Tue, Apr 14, 2020 at 12:06 AM Qian Cai <cai@lca.pw> wrote:
> > On Apr 9, 2020, at 7:29 PM, Stephen Rothwell <sfr@canb.auug.org.au> wrote:
> > Hi Linus,
> >
> > On Thu, 9 Apr 2020 09:32:32 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote:
> >>
> >> On Thu, Apr 9, 2020 at 5:55 AM Dmitry Vyukov <dvyukov@google.com> wrote:
> >>>
> >>> linux-next is boot-broken for more than a month and bugs are piling
> >>> onto bugs, I've seen at least 3 different ones.
> >>> syzbot can't get any working linux-next build for testing for a very
> >>> long time now.
> >>
> >> Ouch.
> >>
> >> Ok, that's not good. It means that linux-next has basically only done
> >> build-testing this whole cycle.
> >
> > Well, there are other CI's beyond syzbot .. Does syzbot only build/test
> > a single kernel arch/config?
> >
> >> Stephen, Dmitry - is there some way linux-next could possibly kick out
> >> trees more aggressively if syzbot can't even boot?

Hello all,

Sorry for corona/holiday-delays. I will try to answer/comment on all
things in this thread in this email.

AI: we need to CC linux-next@ on linux-next build/boot failures. I
will work on this.
We have functionality to CC given emails on _all_ bugs on the given
tree, but we don't have this for build/boot bugs only. I will try to
add this soon.
Stephen, do you want to be CCed as well? Or just linux-next@?

> So old bugs generally should be aged out

This actually happens now.
Bugs without reproducers are auto-closed after 60-120 days since last
occurrence (based on past frequency). And for linux-next the range is
40-60 days.
Bugs with reproducers are not auto-closed. But they are fix bisected
and cause bisected, both of which are only ~66% correct, but still
frequently provide a useful signal. Also bugs with reproducers are
just generally easier to handle.

Another important distinction from Bugzilla is that syzbot dashboard
has up-to-date "Last crash time" information. Click on the "Last"
column here:
https://syzkaller.appspot.com/upstream
It's very easy to ignore everything that happened months ago for
starters, if that's the concern.

So it's not as perfect as it would be with a dedicated human team
attached, but I would say it's now in a reasonable shape with ~400
open bugs that happened within the last month.
And now we have data to confirm that "old" does not mean "irrelevant".
Our leader:
BUG: please report to dccp@vger.kernel.org => prev = 0, last = 0 at
net/dccp/ccids/lib/packet_history.c:LINE/tfrc_rx_hist_sample_rtt()
https://syzkaller.appspot.com/bug?id=0881c535c265ca965edc49c0ac3d0a9850d26eb1
was first triggered 964 days ago, but pretty much still there all that time.

> It would be nice if there was some way we could triage Syzkaller
> bugs into different buckets.

Though, yes, I am afraid of stepping onto the slippery slope of
implementing a full-fledged bug tracking system, I think syzbot will
gather more bug tracker features and tags will happen. We still have
https://github.com/google/syzkaller/issues/608 open and it's mainly
the question of allocating resources for implementation and figuring
out the actual tags hierarchy.
For login and credentials, I guess we will go with just "whoever can
send emails is a root" because we are doing this already anyways
(closing a bug is more critical than changing a tag) :)

Re panic_on_warn.
We don't have a dedicated engineer to sheriff and give manual
consideration and judgement to each case. And as Qian noted, in such
circumstances it's reasonable to don't trust anything after a warning.
Some notorious examples: LOCKDEP warnings disable LOCKDEP; so if we
boot in such state with eyes closed and then try to do fuzzing, or
"better" test a patch for a LOCKDEP error, or do bisection of a
LOCKDEP error, we will immediately give bogus testing results or
bisection culprit.
Or a warning about hung task may re-appear later during testing and
confuse results again.
Or if we ignore KASAN warning, we boot potentially with corrupted
memory with who-knows-what consequences.
A "normal" WARNING may be benign (misuse of WARNING), or maybe not.
Impossible to figure out automatically. And in the end, if we ignore
that, who/when will notice and fix that?
We get this far with this black-and-white criteria for kernel bugs. I
think it had some positive effects on a number of areas, as we go
forward I think it's better to extend panic_on_warn to more testing
systems. Then non-fatal bugs will be no different from fatal bugs
during boot, which we need to handle in a reasonable timeframe anyway.
Which gets me to the next "interesting" point.

> Well, there are other CI's beyond syzbot.
> On the other hand, this makes me worry who is testing on linux-next every day.

How do these use-after-free's and locking bugs get past the
unit-testing systems (which syzbot is not) and remain unnoticed for so
long?...
syzbot uses the dumbest VMs (GCE), so everything it triggers during
boot should be triggerable pretty much everywhere.
It seems to be an action point for the testing systems. "Boot to ssh"
is not the best criteria. Again if there is a LOCKDEP error, we are
not catching any more LOCKDEP errors during subsequent testing. If
there is a use-after-free, that's a serious error on its own and KASAN
produces only 1 error by default as well. And as far as I understand,
lots of kernel testing systems don't even enable KASAN, which is very
wrong.
I've talked to +Dan Rue re this few days ago. Hopefully LKFT will
start catching these as part of unit testing. Which should help with
syzbot testing as well.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] mm: Two small fixes for recent syzbot reports
  2020-04-14 11:12           ` Dmitry Vyukov
@ 2020-04-14 11:59             ` Qian Cai
  2020-04-14 12:05               ` Dmitry Vyukov
  2020-04-14 19:28             ` Dan Rue
  2020-04-16  0:34             ` Stephen Rothwell
  2 siblings, 1 reply; 37+ messages in thread
From: Qian Cai @ 2020-04-14 11:59 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Linus Torvalds, Stephen Rothwell, Andrew Morton, Peter Xu, LKML,
	Linux-MM, Jens Axboe, Christoph Lameter, Johannes Weiner,
	syzkaller, Dan Rue

> On Apr 14, 2020, at 7:13 AM, Dmitry Vyukov <dvyukov@google.com> wrote:
> 
> How do these use-after-free's and locking bugs get past the
> unit-testing systems (which syzbot is not) and remain unnoticed for so
> long?...
> syzbot uses the dumbest VMs (GCE), so everything it triggers during
> boot should be triggerable pretty much everywhere.

There are many reasons that any early testing would not be able to catch ALL the syzbot blockers.

The Kconfigs are different. For example, I don’t have openvswitch enabled, so would miss that ovs rcu-list lockdep warning. Same for that use-after-free in net/bluetooth and a warning in sound subsystem.

But, notifying Linux-next ML is a good start, so at least we could ask Paul or Steve to pull out the commit which enabling rcu-list debugging by default with PROVE_RCU.

I learned through that restricted kconfig to some degree of minimal could save a lot of troubles late on especially those options that I have no way to exercise like net/bluetooth and sound currently. It is going to be extra works though because those default options in Linux-next or even defconfigs are not always pleasant and would want to enable something I don’t need if not given human intervention.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] mm: Two small fixes for recent syzbot reports
  2020-04-14 11:59             ` Qian Cai
@ 2020-04-14 12:05               ` Dmitry Vyukov
  0 siblings, 0 replies; 37+ messages in thread
From: Dmitry Vyukov @ 2020-04-14 12:05 UTC (permalink / raw)
  To: Qian Cai
  Cc: Linus Torvalds, Stephen Rothwell, Andrew Morton, Peter Xu, LKML,
	Linux-MM, Jens Axboe, Christoph Lameter, Johannes Weiner,
	syzkaller, Dan Rue

On Tue, Apr 14, 2020 at 1:59 PM Qian Cai <cai@lca.pw> wrote:
> > On Apr 14, 2020, at 7:13 AM, Dmitry Vyukov <dvyukov@google.com> wrote:
> >
> > How do these use-after-free's and locking bugs get past the
> > unit-testing systems (which syzbot is not) and remain unnoticed for so
> > long?...
> > syzbot uses the dumbest VMs (GCE), so everything it triggers during
> > boot should be triggerable pretty much everywhere.
>
> There are many reasons that any early testing would not be able to catch ALL the syzbot blockers.
>
> The Kconfigs are different. For example, I don’t have openvswitch enabled, so would miss that ovs rcu-list lockdep warning. Same for that use-after-free in net/bluetooth and a warning in sound subsystem.
>
> But, notifying Linux-next ML is a good start, so at least we could ask Paul or Steve to pull out the commit which enabling rcu-list debugging by default with PROVE_RCU.
>
> I learned through that restricted kconfig to some degree of minimal could save a lot of troubles late on especially those options that I have no way to exercise like net/bluetooth and sound currently. It is going to be extra works though because those default options in Linux-next or even defconfigs are not always pleasant and would want to enable something I don’t need if not given human intervention.

We only try to enable what we can reach. There is significant reach
for sound and net/bluetooth even without any hardware. So I would
assume generic testing systems like KernelCI, LKFT, CKI should enable
these as well. Hopefully we don't have all of the sound and
net/bluetooth completely untested in linux-next.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] mm: Two small fixes for recent syzbot reports
  2020-04-14 11:12           ` Dmitry Vyukov
  2020-04-14 11:59             ` Qian Cai
@ 2020-04-14 19:28             ` Dan Rue
  2020-04-15 11:09               ` Dmitry Vyukov
  2020-04-16  0:34             ` Stephen Rothwell
  2 siblings, 1 reply; 37+ messages in thread
From: Dan Rue @ 2020-04-14 19:28 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Qian Cai, Linus Torvalds, Stephen Rothwell, Andrew Morton,
	Peter Xu, LKML, Linux-MM, Jens Axboe, Christoph Lameter,
	Johannes Weiner, syzkaller

On Tue, Apr 14, 2020 at 01:12:50PM +0200, Dmitry Vyukov wrote:
> On Tue, Apr 14, 2020 at 12:06 AM Qian Cai <cai@lca.pw> wrote:
> > Well, there are other CI's beyond syzbot.
> > On the other hand, this makes me worry who is testing on linux-next every day.
> 
> How do these use-after-free's and locking bugs get past the
> unit-testing systems (which syzbot is not) and remain unnoticed for so
> long?...
> syzbot uses the dumbest VMs (GCE), so everything it triggers during
> boot should be triggerable pretty much everywhere.
> It seems to be an action point for the testing systems. "Boot to ssh"
> is not the best criteria. Again if there is a LOCKDEP error, we are
> not catching any more LOCKDEP errors during subsequent testing. If
> there is a use-after-free, that's a serious error on its own and KASAN
> produces only 1 error by default as well. And as far as I understand,
> lots of kernel testing systems don't even enable KASAN, which is very
> wrong.
> I've talked to +Dan Rue re this few days ago. Hopefully LKFT will
> start catching these as part of unit testing. Which should help with
> syzbot testing as well.

LKFT has recently added testing with KASAN enabled and improved the
kernel log parsing to catch more of this class of errors while
performing our regular functional testing.

Incidentally, -next was also broken for us from March 25 through April 5
due to a perf build failure[0], which eventually made itself all the way
down into v5.6 release and I believe the first two 5.6.x stable
releases.

For -next, LKFT's gap is primarily reporting. We do build and run over
30k tests on every -next daily release, but we send out issues manually
when we see them because triaging is still a manual effort. We're
working to build better automated reporting. If anyone is interested in
watching LKFT's -next results more closely (warning, it's a bit noisy),
please let me know. Watching the results at https://lkft.linaro.org
provides some overall health indications, but again, it gets pretty
difficult to figure out signal from noise once you start drilling down
without sufficient context of the system.

Dan

[0] https://lore.kernel.org/stable/CA+G9fYsZjmf34pQT1DeLN_DDwvxCWEkbzBfF0q2VERHb25dfZQ@mail.gmail.com/

-- 
Linaro LKFT
https://lkft.linaro.org

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] mm: Two small fixes for recent syzbot reports
  2020-04-14 19:28             ` Dan Rue
@ 2020-04-15 11:09               ` Dmitry Vyukov
  2020-04-15 16:23                 ` Dan Rue
  0 siblings, 1 reply; 37+ messages in thread
From: Dmitry Vyukov @ 2020-04-15 11:09 UTC (permalink / raw)
  To: Dan Rue
  Cc: Qian Cai, Linus Torvalds, Stephen Rothwell, Andrew Morton,
	Peter Xu, LKML, Linux-MM, Jens Axboe, Christoph Lameter,
	Johannes Weiner, syzkaller

On Tue, Apr 14, 2020 at 9:28 PM Dan Rue <dan.rue@linaro.org> wrote:
>
> On Tue, Apr 14, 2020 at 01:12:50PM +0200, Dmitry Vyukov wrote:
> > On Tue, Apr 14, 2020 at 12:06 AM Qian Cai <cai@lca.pw> wrote:
> > > Well, there are other CI's beyond syzbot.
> > > On the other hand, this makes me worry who is testing on linux-next every day.
> >
> > How do these use-after-free's and locking bugs get past the
> > unit-testing systems (which syzbot is not) and remain unnoticed for so
> > long?...
> > syzbot uses the dumbest VMs (GCE), so everything it triggers during
> > boot should be triggerable pretty much everywhere.
> > It seems to be an action point for the testing systems. "Boot to ssh"
> > is not the best criteria. Again if there is a LOCKDEP error, we are
> > not catching any more LOCKDEP errors during subsequent testing. If
> > there is a use-after-free, that's a serious error on its own and KASAN
> > produces only 1 error by default as well. And as far as I understand,
> > lots of kernel testing systems don't even enable KASAN, which is very
> > wrong.
> > I've talked to +Dan Rue re this few days ago. Hopefully LKFT will
> > start catching these as part of unit testing. Which should help with
> > syzbot testing as well.
>
> LKFT has recently added testing with KASAN enabled and improved the
> kernel log parsing to catch more of this class of errors while
> performing our regular functional testing.
>
> Incidentally, -next was also broken for us from March 25 through April 5
> due to a perf build failure[0], which eventually made itself all the way
> down into v5.6 release and I believe the first two 5.6.x stable
> releases.
>
> For -next, LKFT's gap is primarily reporting. We do build and run over
> 30k tests on every -next daily release, but we send out issues manually
> when we see them because triaging is still a manual effort. We're
> working to build better automated reporting. If anyone is interested in
> watching LKFT's -next results more closely (warning, it's a bit noisy),
> please let me know. Watching the results at https://lkft.linaro.org
> provides some overall health indications, but again, it gets pretty
> difficult to figure out signal from noise once you start drilling down
> without sufficient context of the system.

What kind of failures and noise do you get? Is it flaky tests?
I would assume build failures are ~0% flaky/noisy. And boot failures
are maybe ~1% flaky/noisy due to some infra issues.

I can't find any actual test failure logs in the UI. I've got to this page:
https://qa-reports.linaro.org/lkft/linux-mainline-oe/build/v5.7-rc1-24-g8632e9b5645b/testrun/1363280/suite/kselftest/tests/
which seem to contain failed tests on mainline. But I still can't find
the actual test failure logs.


> Dan
>
> [0] https://lore.kernel.org/stable/CA+G9fYsZjmf34pQT1DeLN_DDwvxCWEkbzBfF0q2VERHb25dfZQ@mail.gmail.com/
>
> --
> Linaro LKFT
> https://lkft.linaro.org

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] mm: Two small fixes for recent syzbot reports
  2020-04-15 11:09               ` Dmitry Vyukov
@ 2020-04-15 16:23                 ` Dan Rue
  0 siblings, 0 replies; 37+ messages in thread
From: Dan Rue @ 2020-04-15 16:23 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Qian Cai, Linus Torvalds, Stephen Rothwell, Andrew Morton,
	Peter Xu, LKML, Linux-MM, Jens Axboe, Christoph Lameter,
	Johannes Weiner, syzkaller

On Wed, Apr 15, 2020 at 01:09:32PM +0200, Dmitry Vyukov wrote:
> On Tue, Apr 14, 2020 at 9:28 PM Dan Rue <dan.rue@linaro.org> wrote:
> >
> > On Tue, Apr 14, 2020 at 01:12:50PM +0200, Dmitry Vyukov wrote:
> > > On Tue, Apr 14, 2020 at 12:06 AM Qian Cai <cai@lca.pw> wrote:
> > > > Well, there are other CI's beyond syzbot.
> > > > On the other hand, this makes me worry who is testing on linux-next every day.
> > >
> > > How do these use-after-free's and locking bugs get past the
> > > unit-testing systems (which syzbot is not) and remain unnoticed for so
> > > long?...
> > > syzbot uses the dumbest VMs (GCE), so everything it triggers during
> > > boot should be triggerable pretty much everywhere.
> > > It seems to be an action point for the testing systems. "Boot to ssh"
> > > is not the best criteria. Again if there is a LOCKDEP error, we are
> > > not catching any more LOCKDEP errors during subsequent testing. If
> > > there is a use-after-free, that's a serious error on its own and KASAN
> > > produces only 1 error by default as well. And as far as I understand,
> > > lots of kernel testing systems don't even enable KASAN, which is very
> > > wrong.
> > > I've talked to +Dan Rue re this few days ago. Hopefully LKFT will
> > > start catching these as part of unit testing. Which should help with
> > > syzbot testing as well.
> >
> > LKFT has recently added testing with KASAN enabled and improved the
> > kernel log parsing to catch more of this class of errors while
> > performing our regular functional testing.
> >
> > Incidentally, -next was also broken for us from March 25 through April 5
> > due to a perf build failure[0], which eventually made itself all the way
> > down into v5.6 release and I believe the first two 5.6.x stable
> > releases.
> >
> > For -next, LKFT's gap is primarily reporting. We do build and run over
> > 30k tests on every -next daily release, but we send out issues manually
> > when we see them because triaging is still a manual effort. We're
> > working to build better automated reporting. If anyone is interested in
> > watching LKFT's -next results more closely (warning, it's a bit noisy),
> > please let me know. Watching the results at https://lkft.linaro.org
> > provides some overall health indications, but again, it gets pretty
> > difficult to figure out signal from noise once you start drilling down
> > without sufficient context of the system.
> 
> What kind of failures and noise do you get? Is it flaky tests?
> I would assume build failures are ~0% flaky/noisy. And boot failures
> are maybe ~1% flaky/noisy due to some infra issues.

Right - infrastructure problems aside (which are the easy part), tests
are quite flaky/noisy.

I guess we're getting quite off topic now, but in LKFT's case we run
tests that are available from the likes of LTP, kselftest, and a variety
of other test suites. Every test was written by a developer with certain
assumptions in place - many of which we violate when we run them on a
small arm board, for example. And many may just be low quality to begin
with, but they often work well enough for the original author's
use-case.

In such cases, we mark them (manually at this point) as a known issue.
For example, here are our kselftest known issues:
https://github.com/Linaro/qa-reports-known-issues/blob/master/kselftests-production.yaml

These lists are quite a chore to keep up to date, and so they tend to
lag reality. What's needed (and what we're working toward) is more
sophisticated analytics on top of our results to determine actual
regressions.

I'll give just one example, randomly selected but typical. Here's a
timer test that sometimes passes and sometimes fails, which compares how
much time something takes with a hard coded value of what the author
expects. Running on small arm hosts or under qemu, the following check
sometimes fails:
https://github.com/torvalds/linux/blob/master/tools/testing/selftests/timers/rtcpie.c#L104-L111

There are _many_ such tests - hundreds or thousands, which rely on hard
coded expectations and are quite hard to "fix". But we run them all
because most of them haven't failed yet, and if they do we'll find out
why.

We ignore the tests which either always fail, or which sometimes fail,
in general. I'm sure there are some legitimate bugs in that set of
failures, but they're probably not "regressions" so just as syzkaller
lets old bugs close automatically, we ignore tests that have a history
of failing.

> 
> I can't find any actual test failure logs in the UI. I've got to this page:
> https://qa-reports.linaro.org/lkft/linux-mainline-oe/build/v5.7-rc1-24-g8632e9b5645b/testrun/1363280/suite/kselftest/tests/
> which seem to contain failed tests on mainline. But I still can't find
> the actual test failure logs.

From the link you gave, if you go up one level to
https://qa-reports.linaro.org/lkft/linux-mainline-oe/build/v5.7-rc1-24-g8632e9b5645b/testrun/1363280/,
you will see links to the "Log File" which takes you to
https://qa-reports.linaro.org/lkft/linux-mainline-oe/build/v5.7-rc1-24-g8632e9b5645b/testrun/1363280/log.

In some test suite cases (perhaps just LTP), we have logs per test. In
most, we just have one large log of the entire run. Even when we have a
log-per-test, it may miss some asynchronous dmesg output I expect,
causing an investigator to look at the whole log anyway.

Dan

> 
> 
> > Dan
> >
> > [0] https://lore.kernel.org/stable/CA+G9fYsZjmf34pQT1DeLN_DDwvxCWEkbzBfF0q2VERHb25dfZQ@mail.gmail.com/
> >
> > --
> > Linaro LKFT
> > https://lkft.linaro.org

-- 
Linaro LKFT
https://lkft.linaro.org

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] mm: Two small fixes for recent syzbot reports
  2020-04-14 11:12           ` Dmitry Vyukov
  2020-04-14 11:59             ` Qian Cai
  2020-04-14 19:28             ` Dan Rue
@ 2020-04-16  0:34             ` Stephen Rothwell
  2020-05-11 15:29               ` Dmitry Vyukov
  2 siblings, 1 reply; 37+ messages in thread
From: Stephen Rothwell @ 2020-04-16  0:34 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Qian Cai, Linus Torvalds, Andrew Morton, Peter Xu, LKML,
	Linux-MM, Jens Axboe, Christoph Lameter, Johannes Weiner,
	syzkaller, Dan Rue

[-- Attachment #1: Type: text/plain, Size: 480 bytes --]

Hi Dmitry,

On Tue, 14 Apr 2020 13:12:50 +0200 Dmitry Vyukov <dvyukov@google.com> wrote:
>
> AI: we need to CC linux-next@ on linux-next build/boot failures. I
> will work on this.
> We have functionality to CC given emails on _all_ bugs on the given
> tree, but we don't have this for build/boot bugs only. I will try to
> add this soon.
> Stephen, do you want to be CCed as well? Or just linux-next@?

Please cc me as well, thanks.

-- 
Cheers,
Stephen Rothwell

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/2] mm: Two small fixes for recent syzbot reports
  2020-04-16  0:34             ` Stephen Rothwell
@ 2020-05-11 15:29               ` Dmitry Vyukov
  0 siblings, 0 replies; 37+ messages in thread
From: Dmitry Vyukov @ 2020-05-11 15:29 UTC (permalink / raw)
  To: Stephen Rothwell
  Cc: Qian Cai, Linus Torvalds, Andrew Morton, Peter Xu, LKML,
	Linux-MM, Jens Axboe, Christoph Lameter, Johannes Weiner,
	syzkaller, Dan Rue

On Thu, Apr 16, 2020 at 2:34 AM Stephen Rothwell <sfr@canb.auug.org.au> wrote:
>
> Hi Dmitry,
>
> On Tue, 14 Apr 2020 13:12:50 +0200 Dmitry Vyukov <dvyukov@google.com> wrote:
> >
> > AI: we need to CC linux-next@ on linux-next build/boot failures. I
> > will work on this.
> > We have functionality to CC given emails on _all_ bugs on the given
> > tree, but we don't have this for build/boot bugs only. I will try to
> > add this soon.
> > Stephen, do you want to be CCed as well? Or just linux-next@?
>
> Please cc me as well, thanks.

To close the loop: I implemented and deployed 2 improvements:

1. Notion of per-repo "build maintainers":
https://github.com/google/syzkaller/commit/88cb3e92ba25303ab67aaceb083fe7304fccd32f
Now linux-next@ and Stephen should be CCed on all linux-next breakages
automatically.

2. Finding maintainers for build errors:
https://github.com/google/syzkaller/commit/65a44e22ba217ef7272b9d3735e9d12cfaa204f6
syzbot will attempt to extract the broken file name from make output
and then run get_matainers.pl on it to find relevant emails.

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2020-05-11 15:29 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20200414040717.22040-1-hdanton@sina.com>
2020-04-14  4:31 ` [PATCH 0/2] mm: Two small fixes for recent syzbot reports Jens Axboe
2020-04-08  1:40 Peter Xu
2020-04-09  0:47 ` Andrew Morton
2020-04-09 11:49   ` Matthew Wilcox
2020-04-09 13:00     ` Dmitry Vyukov
2020-04-09 18:16       ` Andrew Morton
2020-04-09 18:53         ` Linus Torvalds
2020-04-09 19:12           ` Andrew Morton
2020-04-09 19:46             ` Linus Torvalds
2020-04-09 19:56               ` Matthew Wilcox
2020-04-09 19:58                 ` Linus Torvalds
2020-04-09 20:27                   ` Eric Biggers
2020-04-09 20:34                     ` Linus Torvalds
2020-04-09 23:34                       ` Stephen Rothwell
2020-04-10  1:11                       ` Theodore Y. Ts'o
2020-04-09 12:55   ` Dmitry Vyukov
2020-04-09 16:32     ` Linus Torvalds
2020-04-09 16:58       ` Qian Cai
2020-04-09 17:05         ` Linus Torvalds
2020-04-09 17:58           ` Qian Cai
2020-04-09 18:06             ` Linus Torvalds
2020-04-09 21:14               ` Qian Cai
2020-04-10 13:12                 ` Tetsuo Handa
2020-04-10 14:26                   ` Qian Cai
2020-04-10 17:26                     ` Andrew Morton
2020-04-10 19:46                       ` Qian Cai
2020-04-09 23:29       ` Stephen Rothwell
2020-04-13 22:06         ` Qian Cai
2020-04-13 23:05           ` Jens Axboe
2020-04-14 11:12           ` Dmitry Vyukov
2020-04-14 11:59             ` Qian Cai
2020-04-14 12:05               ` Dmitry Vyukov
2020-04-14 19:28             ` Dan Rue
2020-04-15 11:09               ` Dmitry Vyukov
2020-04-15 16:23                 ` Dan Rue
2020-04-16  0:34             ` Stephen Rothwell
2020-05-11 15:29               ` Dmitry Vyukov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).