Re: [syzbot] WARNING in do_mkdirat

From: "Theodore Ts'o" <tytso@mit.edu>
To: Hillf Danton <hdanton@sina.com>
Cc: Matthew Wilcox <willy@infradead.org>,
	Al Viro <viro@zeniv.linux.org.uk>,
	syzbot <syzbot+919c5a9be8433b8bf201@syzkaller.appspotmail.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	syzkaller-bugs@googlegroups.com
Subject: Re: [syzbot] WARNING in do_mkdirat
Date: Mon, 12 Dec 2022 13:58:51 -0500	[thread overview]
Message-ID: <Y5d565XVsinbNNL2@mit.edu> (raw)
In-Reply-To: <20221212032911.2965-1-hdanton@sina.com>

On Mon, Dec 12, 2022 at 11:29:11AM +0800, Hillf Danton wrote:
> > You've completely misunderstood Al's point.  He's not whining about
> > being cc'd, he's pointing at this is ONLY USEFUL IF THE NTFS3
> > MAINTAINERS ARE CC'd.  And they're not.  So this is just noise.
> > And enough noise means that signal is lost.
> 
> Call Trace:
>  <TASK>
>  inode_unlock include/linux/fs.h:761 [inline]
>  done_path_create fs/namei.c:3857 [inline]
>  do_mkdirat+0x2de/0x550 fs/namei.c:4064
>  __do_sys_mkdirat fs/namei.c:4076 [inline]
>  __se_sys_mkdirat fs/namei.c:4074 [inline]
>  __x64_sys_mkdirat+0x85/0x90 fs/namei.c:4074
>  do_syscall_x64 arch/x86/entry/common.c:50 [inline]
>  do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80
>  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> 
> Given the call trace above, how do you know the ntfs3 guys should be also
> Cced in addition to AV? What if it would take more than three months for
> syzbot to learn the skills in your mind? What is preventing you routing
> the report to ntfs3?

If it takes 3 months for syzbot to take a look at the source code in
their own #!@?! reproducer, or just to take a look at the strace link
in the dashboard:

[pid  3639] mount("/dev/loop0", "./file2", "ntfs3", MS_NOSUID|MS_NOEXEC|MS_DIRSYNC|MS_I_VERSION, "") = 0

There's something really wrong.  The point Al has been making (and
I've been making for multiple years) is that Syzbot has the
information, but unfortunately, at the moment, it is only analyzing
the the stack trace, and it is not doing things that really could be
done automatically --- and cloud VM time is cheap, and upstream
maintainer time is expensive.  So by not improving syzbot in a way
that really shouldn't be all that difficult, the syzbot maintainers is
disrespectiving the time of the upstream maintainers.

So sure, we could ask Linus to triage all syzbot reports --- or we
could ask Al to triage all syzbot file system reports --- but that is
not a good use of upstream resources.

And "we didn't know this is super annoying" isn't an excuse, because
I've been asking for things like this *before* the COVID pandemic.  So
if the Syzbot team won't listen to observations by a random Google
engineer who happens to be an ext4 maintainer (or rather, I'm sure
they were listening, but they didn't consider it important enough to
staff and put on the roadmap), maybe something a bit
more.... assertive by Al is something that will inspire them to
prioritize this feature request "above the fold".  :-)

And Al does have a point --- if a lot of upstream maintainers consider
Syzbot reports to be less than useful, they will either auto-file
reports to a junk folder, or just ignore the Syzbot reports because
they are busy and the Probability(Usefulness) is close to zero, then
recovering from that black eye to Syzbot's reputation is going to be a
lot more difficult than if Syzbot was made more respectful of upstream
maintainer time much earlier.

Now, to be fair to the Syzbot team, the Syzbot console has gotten much
better.  You can now download the syzbot trace, and download the
mounted file system, when before, you had to do a lot more work to
extract the file system (which is stored in separate constant C
array's as compressed data) from the C reproducer.  So have things
have gotten better.

But at the same time, characterizing a syzbot report is something to
be done by every file system maintainer who looks as a syzbot report,
because there is no way to add a tag to the syzbot report that this
particular syzbot report *really* is an ntfs3 issue.  So any
information that a single developer figures out when triaging a bug
(is this potentially an ext4 bug, nope, it's an ntfs3 bug) has to be
replicated by every single kernel developer looking at the Syzbot
dashboard.  Which again, is not respectful of upstream maintainers'
time.

						- Ted