linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Sasha Levin <sashal@kernel.org>
Cc: Greg KH <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org, linux-kernel@vger.kernel.org,
	Dave Chinner <dchinner@redhat.com>,
	"Darrick J . Wong" <darrick.wong@oracle.com>,
	linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH AUTOSEL 4.14 25/35] iomap: sub-block dio needs to zeroout beyond EOF
Date: Mon, 3 Dec 2018 10:23:03 +1100	[thread overview]
Message-ID: <20181202232302.GT19305@dastard> (raw)
In-Reply-To: <20181201074909.GC213156@sasha-vm>

On Sat, Dec 01, 2018 at 02:49:09AM -0500, Sasha Levin wrote:
> On Sat, Dec 01, 2018 at 08:50:05AM +1100, Dave Chinner wrote:
> >On Fri, Nov 30, 2018 at 05:14:41AM -0500, Sasha Levin wrote:
> >>On Fri, Nov 30, 2018 at 09:22:03AM +0100, Greg KH wrote:
> >>>On Fri, Nov 30, 2018 at 09:40:19AM +1100, Dave Chinner wrote:
> >>>>I stopped my tests at 5 billion ops yesterday (i.e. 20 billion ops
> >>>>aggregate) to focus on testing the copy_file_range() changes, but
> >>>>Darrick's tests are still ongoing and have passed 40 billion ops in
> >>>>aggregate over the past few days.
> >>>>
> >>>>The reason we are running these so long is that we've seen fsx data
> >>>>corruption failures after 12+ hours of runtime and hundreds of
> >>>>millions of ops. Hence the testing for backported fixes will need to
> >>>>replicate these test runs across multiple configurations for
> >>>>multiple days before we have any confidence that we've actually
> >>>>fixed the data corruptions and not introduced any new ones.
> >>>>
> >>>>If you pull only a small subset of the fixes, the fsx will still
> >>>>fail and we have no real way of actually verifying that there have
> >>>>been no regression introduced by the backport.  IOWs, there's a
> >>>>/massive/ amount of QA needed for ensuring that these backports work
> >>>>correctly.
> >>>>
> >>>>Right now the XFS developers don't have the time or resources
> >>>>available to validate stable backports are correct and regression
> >>>>fre because we are focussed on ensuring the upstream fixes we've
> >>>>already made (and are still writing) are solid and reliable.
> >>>
> >>>Ok, that's fine, so users of XFS should wait until the 4.20 release
> >>>before relying on it?  :)
> >>
> >>It's getting to the point that with the amount of known issues with XFS
> >>on LTS kernels it makes sense to mark it as CONFIG_BROKEN.
> >
> >Really? Where are the bug reports?
> 
> In 'git log'! You report these every time you fix something in upstream
> xfs but don't backport it to stable trees:

That is so wrong on so many levels I don't really know where to
begin. I guess doing a *basic risk analysis* demonstrating that none
of those fixes are backport candidates is a good start:

> $ git log --oneline v4.18-rc1..v4.18 fs/xfs
> d4a34e165557 xfs: properly handle free inodes in extent hint validators

Found by QA with generic/229 on a non-standard config. Not user
reported, unlikely to ever be seen by users.

> 9991274fddb9 xfs: Initialize variables in xfs_alloc_get_rec before using them

Cleaning up coverity reported issues to do with corruption log
messages. No visible symptoms, Not user reported.

> d8cb5e423789 xfs: fix fdblocks accounting w/ RMAPBT per-AG reservation

Minor free space accounting issue, not user reported, doesn't affect
normal operation.

> e53c4b598372 xfs: ensure post-EOF zeroing happens after zeroing part of a file

Found with fsx via generic/127. Not user reported, doesn't affect
userspace operation at all.

> a3a374bf1889 xfs: fix off-by-one error in xfs_rtalloc_query_range

Regression fix for code introduced in 4.18-rc1. Not user reported
because the code has never been released.

> 232d0a24b0fc xfs: fix uninitialized field in rtbitmap fsmap backend

Coverity warning fix, not user reported, not user impact.

> 5bd88d153998 xfs: recheck reflink state after grabbing ILOCK_SHARED for a write

Fixes warning from generic/166, not user reported. Could affect
users mixing direct IO with reflink, but we expect people using
new functionality like reflink to be tracking TOT fairly closely
anyway.

> f62cb48e4319 xfs: don't allow insert-range to shift extents past the maximum offset

Found by QA w/ generic/465. Not user reported, only affects files in
the exabyte range so not a real world problem....

> aafe12cee0b1 xfs: don't trip over negative free space in xfs_reserve_blocks

Found during ENOSPC stress tests that depeleted the reserve pool.
Not user reported, unlikely to ever be hit by users.

> 10ee25268e1f xfs: allow empty transactions while frozen

Removes a spurious warning when running GETFSMAP ioctl on a frozen
filesystem. Not user reported, highly unlikely any user will ever
hit this as nothing but XFs utilities use GETFSMAP at the moment.

> e53946dbd31a xfs: xfs_iflush_abort() can be called twice on cluster writeback failure

Bug in corrupted filesystem handling, been there for ~15 years IIRC.
Not user reported - found by one of our shutdown stress tests
on a debug kernel (generic/388, IIRC). Highly unlikely to show up in
the real world given how long the bug has been there.

> 23fcb3340d03 xfs: More robust inode extent count validation

Found by filesystem image fuzzing (i.e. intentional filesystem
corruption). Not user reported, and the filesystem corruption that
triggered this problem is so artificial there is really no chance of
it ever occurring in the real world.

> e2ac836307e3 xfs: simplify xfs_bmap_punch_delalloc_range

Cleanup and simplification. Not a bug fix, not user reported, not a
backport candidate.

IOWs, there isn't a single commit in this list that is user
reported, nor anything that I'd consider a stable kernel backport
candidate because none of them affect normal user workloads. i.e.
they've all be found by tools designed to break filesystems and
exercise rarely travelled error paths.

> Since I'm assuming that at least some of them are based on actual issues
> users hit, and some of those apply to stable kernels, why would users
> want to use an XFS version which is knowingly buggy?

Your assumption is not only incorrect, it is fundamentally flawed.
A list of commits containing bug fixes is not a list of bug reports
from users.

IOWs, backporting them only increases the risk of regressions for
users, it doesn't reduce the risk of users hitting problems or fix
any problems that users are at risk of actually hitting. IOWs, all
of these changes fall on the wrong side of the risk-benefit analysis
equation.

Risk/benefit analysis is fundamental to software engineering
processes.  Running "git log" is not a risk analysis - it's just
provides a list of things that you need to perform an analysis on.
Risk analsysis takes time and effort, and to imply that it is not
necessary and we should just backport everything makes the incorrect
assumption that backporting carries no risk at all.

It seems to me that the stable kernel process measures itself on how
many commits an dhow fast they are backported from mainline kernels,
and the entire focus of improvement is on backporting /more/ commits
/faster/. i.e.  it's all about the speed and quantity of code being
moved back to the "stable" kernels. What it should be doing is
identifying and addressing bugs or flaws that put users are risk or
that users are reporting.

Further, the speed at which backports occur (i.e. within a day or 3
of upstream commit) means that the code being backported hasn't had
time to reach a wide testing audience and have regressions shaken
out of it. The whole purpose of having progressively stricter -rcX
upstream kernel releases is to allow the new code to stabilise and
shake out unforseen regressions before it gets to users. The stable
process is actually releasing upstream code to users before they can
even get it in a released upstream kernel (i.e. a .0 kernel, not a
-rcX).

IOWs, pulling code back to stable kernels before it's had a chance
to stabilise and be more widely tested in the upstream kernel is
entirely the wrong thing to be doing. Speed here does not improve
stability, it just increases the risk of regressions and unforseen
bugs being introduced into the stable tree. And that's made worse by
the fact that the -rcX process and widespread upstream testing that
goes along with it* to catch those bugs and regressions. And that's
made even worse by the fact that subsystems don't have control over
what is backported anymore, so they may not even be aware that a fix
for a fix needs to be sent back to stable kernels.

This is the issue here - the "stable kernel" criteria is not about
stability - it's being optimised to shovel as much change as
possible with /as little effort as possible/ back into older code
bases. That's not a recipe for stability, especially considering the
relative lack of QA the stable kernels get.

IMO, the whole set of linux kernel processes are being optimised
around the wrong metrics - we count new features, the number of
commits per release and the quantity of code that gets changed. We
then optimise our processes to increase these metrics. IOWs, we're
optimising for speed and rapid change, not quality, reliability and
stability.

We are not measuring code quality improvements, how effective our
code review is, we do not do post-mortem analysis of major failures
and we most certainly don't change processes to avoid those problems
in future, etc. And worst of all is that people who want better
processes to improve code quality, testing, etc get shouted at
because it may slow down the rate at which we change code. i.e. only
"speed and quantity" seems to matter to the core upstream kernel
developement community.

As Darrick said, what we are seeing here is a result of "[...] the
kernel community's systemic inability to QA new fs features
properly." I'm taking that one step further - what we are seeing
here is the kernel community's systemic inability to address
fundamental engineering process deficiencies because "speed and
quantity" are considered more important than the quality of the
product being produced.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  parent reply	other threads:[~2018-12-02 23:23 UTC|newest]

Thread overview: 59+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-11-29  6:00 [PATCH AUTOSEL 4.14 01/35] media: omap3isp: Unregister media device as first Sasha Levin
2018-11-29  6:00 ` [PATCH AUTOSEL 4.14 02/35] iommu/vt-d: Fix NULL pointer dereference in prq_event_thread() Sasha Levin
2018-11-29  6:00 ` [PATCH AUTOSEL 4.14 03/35] brcmutil: really fix decoding channel info for 160 MHz bandwidth Sasha Levin
2018-11-29  6:00 ` [PATCH AUTOSEL 4.14 04/35] iommu/ipmmu-vmsa: Fix crash on early domain free Sasha Levin
2018-11-29  6:00 ` [PATCH AUTOSEL 4.14 05/35] can: rcar_can: Fix erroneous registration Sasha Levin
2018-11-29  6:00 ` [PATCH AUTOSEL 4.14 06/35] test_firmware: fix error return getting clobbered Sasha Levin
2018-11-29  6:00 ` [PATCH AUTOSEL 4.14 07/35] HID: input: Ignore battery reported by Symbol DS4308 Sasha Levin
2018-11-29  6:00 ` [PATCH AUTOSEL 4.14 08/35] batman-adv: Use explicit tvlv padding for ELP packets Sasha Levin
2018-11-29  6:00 ` [PATCH AUTOSEL 4.14 09/35] batman-adv: Expand merged fragment buffer for full packet Sasha Levin
2018-11-29  6:00 ` [PATCH AUTOSEL 4.14 10/35] amd/iommu: Fix Guest Virtual APIC Log Tail Address Register Sasha Levin
2018-11-29  6:00 ` [PATCH AUTOSEL 4.14 11/35] bnx2x: Assign unique DMAE channel number for FW DMAE transactions Sasha Levin
2018-11-29  6:00 ` [PATCH AUTOSEL 4.14 12/35] qed: Fix PTT leak in qed_drain() Sasha Levin
2018-11-29  6:00 ` [PATCH AUTOSEL 4.14 13/35] qed: Fix reading wrong value in loop condition Sasha Levin
2018-11-29  6:00 ` [PATCH AUTOSEL 4.14 14/35] Revert "usb: gadget: ffs: Fix BUG when userland exits with submitted AIO transfers" Sasha Levin
2018-11-29  6:00 ` [PATCH AUTOSEL 4.14 15/35] net/mlx4_core: Zero out lkey field in SW2HW_MPT fw command Sasha Levin
2018-11-29  6:00 ` [PATCH AUTOSEL 4.14 16/35] net/mlx4_core: Fix uninitialized variable compilation warning Sasha Levin
2018-11-29  6:00 ` [PATCH AUTOSEL 4.14 17/35] net/mlx4: Fix UBSAN warning of signed integer overflow Sasha Levin
2018-11-29  6:00 ` [PATCH AUTOSEL 4.14 18/35] gpio: mockup: fix indicated direction Sasha Levin
2018-11-29  6:00 ` [PATCH AUTOSEL 4.14 19/35] mtd: rawnand: qcom: Namespace prefix some commands Sasha Levin
2018-11-29  6:00 ` [PATCH AUTOSEL 4.14 20/35] exec: make de_thread() freezable Sasha Levin
2018-11-29  6:00 ` [PATCH AUTOSEL 4.14 21/35] HID: multitouch: Add pointstick support for Cirque Touchpad Sasha Levin
2018-11-29  6:00 ` [PATCH AUTOSEL 4.14 22/35] mtd: spi-nor: Fix Cadence QSPI page fault kernel panic Sasha Levin
2018-11-29  6:00 ` [PATCH AUTOSEL 4.14 23/35] qed: Fix bitmap_weight() check Sasha Levin
2018-11-29  6:00 ` [PATCH AUTOSEL 4.14 24/35] qed: Fix QM getters to always return a valid pq Sasha Levin
2018-11-29  6:00 ` [PATCH AUTOSEL 4.14 25/35] iomap: sub-block dio needs to zeroout beyond EOF Sasha Levin
2018-11-29 12:14   ` Dave Chinner
2018-11-29 12:47     ` Greg KH
2018-11-29 22:40       ` Dave Chinner
2018-11-30  8:22         ` Greg KH
2018-11-30 10:14           ` Sasha Levin
2018-11-30 20:35             ` Darrick J. Wong
2018-11-30 21:50             ` Dave Chinner
2018-12-01  7:49               ` Sasha Levin
2018-12-01  9:09                 ` XFS patches for stable Amir Goldstein
2018-12-02 15:25                   ` Sasha Levin
2018-12-02 16:10                     ` Christoph Hellwig
2018-12-02 20:08                       ` Greg KH
2018-12-03 14:41                         ` Richard Weinberger
2018-12-03 16:56                           ` Sasha Levin
2018-12-02 23:23                 ` Dave Chinner [this message]
2018-12-03  7:11                   ` [PATCH AUTOSEL 4.14 25/35] iomap: sub-block dio needs to zeroout beyond EOF Amir Goldstein
2018-12-03  9:22                   ` Sasha Levin
2018-12-03 21:23                     ` Thomas Backlund
2018-12-04  7:28                       ` Greg KH
2018-12-04  8:12                       ` Sasha Levin
2018-12-28  8:06                       ` Pavel Machek
2018-12-29 23:35                         ` Dave Chinner
2018-11-30 21:45           ` Dave Chinner
2018-12-02 20:11             ` Greg KH
2018-11-29  6:01 ` [PATCH AUTOSEL 4.14 26/35] net: faraday: ftmac100: remove netif_running(netdev) check before disabling interrupts Sasha Levin
2018-11-29  6:01 ` [PATCH AUTOSEL 4.14 27/35] iommu/vt-d: Use memunmap to free memremap Sasha Levin
2018-11-29  6:01 ` [PATCH AUTOSEL 4.14 28/35] flexfiles: use per-mirror specified stateid for IO Sasha Levin
2018-11-29  6:01 ` [PATCH AUTOSEL 4.14 29/35] net: thunderx: set xdp_prog to NULL if bpf_prog_add fails Sasha Levin
2018-11-29  6:01 ` [PATCH AUTOSEL 4.14 30/35] ibmvnic: Fix RX queue buffer cleanup Sasha Levin
2018-11-29  6:01 ` [PATCH AUTOSEL 4.14 31/35] virtio-net: disable guest csum during XDP set Sasha Levin
2018-11-29  6:01 ` [PATCH AUTOSEL 4.14 32/35] virtio-net: fail XDP set if guest csum is negotiated Sasha Levin
2018-11-29  6:01 ` [PATCH AUTOSEL 4.14 33/35] team: no need to do team_notify_peers or team_mcast_rejoin when disabling port Sasha Levin
2018-11-29  6:01 ` [PATCH AUTOSEL 4.14 34/35] net: amd: add missing of_node_put() Sasha Levin
2018-11-29  6:01 ` [PATCH AUTOSEL 4.14 35/35] net: thunderx: set tso_hdrs pointer to NULL in nicvf_free_snd_queue Sasha Levin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20181202232302.GT19305@dastard \
    --to=david@fromorbit.com \
    --cc=darrick.wong@oracle.com \
    --cc=dchinner@redhat.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=sashal@kernel.org \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).