All of lore.kernel.org
 help / color / mirror / Atom feed
* [Bug 201685] New: ext4 file system corruption
@ 2018-11-13 19:42 bugzilla-daemon
  2018-11-14 21:20 ` [Bug 201685] " bugzilla-daemon
                   ` (270 more replies)
  0 siblings, 271 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-13 19:42 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

            Bug ID: 201685
           Summary: ext4 file system corruption
           Product: File System
           Version: 2.5
    Kernel Version: maybe one of 4.18.18 4.19.1 4.20-rc2
          Hardware: All
                OS: Linux
              Tree: Mainline
            Status: NEW
          Severity: normal
          Priority: P1
         Component: ext4
          Assignee: fs_ext4@kernel-bugs.osdl.org
          Reporter: claude@mathr.co.uk
        Regression: No

Created attachment 279431
  --> https://bugzilla.kernel.org/attachment.cgi?id=279431&action=edit
dmesg 4.18.18 amdgpu.dc=0

My system was fine when I shut it down on Sunday Nov 11.  Today Nov 13 I booted
4.19.1, built two new kernels 4.20-rc2 4.18.18 (using a tmpfs, not the SSD or
HDD), then booted into those kernels briefly (to test if a different bug had
been fixed).  Finally I booted into 4.18.18 (setting amdgpu.dc=0 to workaround
my other bug), and after some moments experienced symptoms of filesystem
corruption on opening an xterm:

    sed: error while loading shared libraries:
/lib/x86_64-linux-gnu/libattr.so.1: unexpected PLT reloc type 0x00000107
    sed: error while loading shared libraries:
/lib/x86_64-linux-gnu/libattr.so.1: unexpected PLT reloc type 0x00000107
    claude@eiskaffee:~$ 

I fixed it by extracting the relevant file from the Debian archive on a
different machine and using `cat` with `bash` shell IO redirection to overwrite
the corrupted shared library file on my problem machine.

Here are the relevant versions extracted from my syslog:

    Nov 13 15:45:49 eiskaffee kernel: [    0.000000] Linux version 4.19.1
(claude@eiskaffee) (gcc version 8.2.0 (Debian 8.2.0-9)) #1 SMP Tue Nov 6
14:58:04 GMT 2018
    Nov 13 18:44:12 eiskaffee kernel: [    0.000000] Linux version 4.20.0-rc2
(claude@eiskaffee) (gcc version 8.2.0 (Debian 8.2.0-9)) #1 SMP Tue Nov 13
16:38:55 GMT 2018
    Nov 13 18:45:00 eiskaffee kernel: [    0.000000] Linux version 4.18.18
(claude@eiskaffee) (gcc version 8.2.0 (Debian 8.2.0-9)) #1 SMP Tue Nov 13
16:23:11 GMT 2018
    Nov 13 18:46:13 eiskaffee kernel: [    0.000000] Linux version 4.18.18
(claude@eiskaffee) (gcc version 8.2.0 (Debian 8.2.0-9)) #1 SMP Tue Nov 13
16:23:11 GMT 2018

mount says:

    /dev/nvme0n1p2 on / type ext4 (rw,relatime,errors=remount-ro)


The machine in question is my production workstation, so I don't feel like
testing anything that might result in data loss.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
@ 2018-11-14 21:20 ` bugzilla-daemon
  2018-11-15  4:37 ` bugzilla-daemon
                   ` (269 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-14 21:20 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

Jason Gambrel (jaygambrel@gmail.com) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jaygambrel@gmail.com

--- Comment #1 from Jason Gambrel (jaygambrel@gmail.com) ---
Similar problems on Linux Mint 19 Tara using kernels 4.19.0 and 4.19.1

On rebooting my ASUS UX 430U laptop I ended up at the initramfs prompt and had
to run fsck to repair the root file system.  I was then able to continue
booting.  Sorry didn't save the log files.  This has happened randomly twice. 
I will post them if this happens again.

Did not happen under 4.18

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
  2018-11-14 21:20 ` [Bug 201685] " bugzilla-daemon
@ 2018-11-15  4:37 ` bugzilla-daemon
  2018-11-15 16:19 ` bugzilla-daemon
                   ` (268 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-15  4:37 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

Theodore Tso (tytso@mit.edu) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |tytso@mit.edu

--- Comment #2 from Theodore Tso (tytso@mit.edu) ---
I'm using a 4.19.0 based kernel (with some ext4 patches for the 4.20 mainline)
and I'm not noticing any file system problems.   I'm running a Dell XPS 13 with
an NVME SSD, and Debian testing as my userspace.

It's hard to do anything with a "my file system is corrupted" report without
any kind of reliable reproduction information.   Remember that file system
corruptions can be caused by any number of things --- buggy device drivers,
buggy Nvidia binary modules that dereference wild pointers and randomly corrupt
kernel memory, RAID code if you are using RAID, etc., etc.

Also, the symptoms reported by Claude and Jason are very different.  Claude has
reported that a data block in a shared library file has gotten corrupted. 
Jason has reported that file system metadata corruption.   This could very well
be coming from different root causes.

So it's better with these sorts of things to file separate bugs, and to include
detailed hardware configuration details, kernel configuration, dumpe2fs outputs
of the file system in question, as well as e2fsck logs.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
  2018-11-14 21:20 ` [Bug 201685] " bugzilla-daemon
  2018-11-15  4:37 ` bugzilla-daemon
@ 2018-11-15 16:19 ` bugzilla-daemon
  2018-11-15 16:43 ` bugzilla-daemon
                   ` (267 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-15 16:19 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #3 from Jason Gambrel (jaygambrel@gmail.com) ---
(In reply to Theodore Tso from comment #2)
> I'm using a 4.19.0 based kernel (with some ext4 patches for the 4.20
> mainline) and I'm not noticing any file system problems.   I'm running a
> Dell XPS 13 with an NVME SSD, and Debian testing as my userspace.
> 
> It's hard to do anything with a "my file system is corrupted" report without
> any kind of reliable reproduction information.   Remember that file system
> corruptions can be caused by any number of things --- buggy device drivers,
> buggy Nvidia binary modules that dereference wild pointers and randomly
> corrupt kernel memory, RAID code if you are using RAID, etc., etc.
> 
> Also, the symptoms reported by Claude and Jason are very different.  Claude
> has reported that a data block in a shared library file has gotten
> corrupted.  Jason has reported that file system metadata corruption.   This
> could very well be coming from different root causes.
> 
> So it's better with these sorts of things to file separate bugs, and to
> include detailed hardware configuration details, kernel configuration,
> dumpe2fs outputs of the file system in question, as well as e2fsck logs.

Thank you for your reply Theodore and I apologize for my unhelpful post.  I am
relatively new to this space so I find your advice very helpful.  If the file
system corruption happens a 3rd time (hopefully it won't), I will post a
separate bug report.  I also wasn't previously aware of dumpe2fs, so I will
provide that helpful information next time. I have also searched to find any
additional logs and it looks like fsck logs the boot info under
/var/logs/boot.log and potentially /var/logs/syslog.  Unfortunately the
information from my last boot requiring fixation had already been overwritten. 
I will keep this in mind for the future.

If it helps, my system uses an i7 with integrated Intel graphics.  I am not
running any proprietary drivers.  No Raid. 500gb SSD. 16gb ram with a 4gb swap
file (not a swap partition).  I have been using ukuu to install mainline
kernels.  I did not change anything else on my system.  When I jumped from
4.18.17 to 4.19.0 this problem first appeared.  Then it occurred again after
updating to 4.19.1.

I'm uncertain as to whether it would be helpful or not, but while trying to
figure out why this happened to me, I came across a post on Ask Ubuntu with a
few others reporting similar problems.  They did provide some debugging
information in their post at:
https://askubuntu.com/questions/1092558/ubuntu-18-04-4-19-1-kernel-after-closing-the-lid-for-the-night-not-logging-ou

Again it might be a different problem from what Claude Heiland-Allen is
experiencing.

Thank you very much for your advice and I will try and provide some useful
information including logs in a separate bug report if it happens again.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (2 preceding siblings ...)
  2018-11-15 16:19 ` bugzilla-daemon
@ 2018-11-15 16:43 ` bugzilla-daemon
  2018-11-16 15:09 ` bugzilla-daemon
                   ` (266 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-15 16:43 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #4 from Theodore Tso (tytso@mit.edu) ---
Thanks for pointing out that bug.   I'll note that the poster who
authoritatively claimed that 4.19 is safe, and the bug obviously was introduced
in 4.19.1 didn't bother to do a "git log --stat v4.19 v4.19.1".   This would
show that the changes were all in the Sparc architecture support, networking
drivers, the networking stack, and a one-line change in the crypto
subsystem....

This is why I always tell users to report symptoms, not diagnosis.   And for
sure, not to bias their observations by their their certainty that they have
diagnosed the problem.    (If they think they have diagnosed the problem, send
me a patch, preferably with a reliable repro so we can add a regression test. 
:-)

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (3 preceding siblings ...)
  2018-11-15 16:43 ` bugzilla-daemon
@ 2018-11-16 15:09 ` bugzilla-daemon
  2018-11-16 19:03 ` bugzilla-daemon
                   ` (265 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-16 15:09 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

carlphilippreh@gmail.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |carlphilippreh@gmail.com

--- Comment #5 from carlphilippreh@gmail.com ---
Two of my Linux machines experience regular ext4 file system corruption since I
updated them to 4.19. 4.18 was fine. I noticed the problem first when certain
file operations returned "Structure needs cleaning". fsck then mostly finds
dangling inodes (of files I have written recently), incorrect reference counts,
and so on. Both machines do not use RAID, don't use any proprietary drivers and
both have an Intel board. One of them uses an SSD and one of them a HDD.
Unfortunately, I don't know which information might be useful to you.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (4 preceding siblings ...)
  2018-11-16 15:09 ` bugzilla-daemon
@ 2018-11-16 19:03 ` bugzilla-daemon
  2018-11-20  5:57 ` bugzilla-daemon
                   ` (264 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-16 19:03 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #6 from Theodore Tso (tytso@mit.edu) ---
Please send detailed information about your hardware (lspci -v and dmesg while
it is booting would be helpful).   Also please send the results of running
dumpe2fs on the file system, and the kernel logs when file system operations
started returning "Structure needs cleaning".   I want to see if there are any
other kernel messages in and around the ext4 error messages that will be in the
kernel logs.    Also please send me the fsck logs, and what sort of workload
(what programs) you have running on your system.   Also, do you do anything
unusual on your machine; do you typically do clean shutdowns, or do you just do
forced power-offs?   Are you regularly running into a large amount of memory
pressure (e.g., are you regularly using a large percentage of the physical
memory available on your system.)

This is going to end up being a process of elimination.   4.19 works for me.  
I'm using a 2018 Dell XPS 13, Model 9370, with 16GB of memory and I run a
typical kernel developer workload.   We also run a large number of ext4
regression testing, which generally happens on KVM for one developer, and I use
Google Compute Engine for my tests.   None of this detected any problems before
4.19 was released.    So the question then is --- what makes people who are
experiencing difficulties different from my development laptop (which also has
an Intel board, and an SSD connected using NVMe) from those who are seeing
problems?  This is why getting lots of details about the precise hardware
configuration is going to be critically important.

In the ideal world we would come up with a clean, simple, reliable reproducer. 
Then we can experiment and see if the reliable reproducer continues to
reproduce on different hardware, etc.   

Finally, since in order to figure things out we may need a lot of detail about
the hardware, the software, and the applications running on each of the systems
where people are seeing problems, it's helpful if new people upload all of this
information onto new kernel bugzilla issues, and then mention the kernel
bugzilla issue here, so people can follow the links.

I'll note that a few years ago, we had a mysterious "ext4 failure" that
ultimately turned out to be a Intel virtualization hardware bug, and it was the
*host* version that mattered, not the *guest* kernel version that mattered. 
Worse, it was fixed in the very next vesion of the kernel, and so it was only
people using Debian host kernels that ran into troubles --- but **only** if
they were using a specific Intel chipset and Intel CPU generation.   Everyone
kept on swearing up and down it was an ext4 bug, and there were many angry
people arguing this on bugzilla.   Ultimately, it was a problem caused by a
hardware bug, and a kernel workaround that was in 3.18 but not in 3.17, and
Debian hadn't noticed they needed to backport the kernel workaround....   And
because everyone was *certain* that the host kernel version didn't matter ---
after all, it was *obviously* an ext4 bug in the guest kernel --- they didn't
report it, and that made figuring out what the problem was (it took over a
year) much, Much, MUCH harder.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (5 preceding siblings ...)
  2018-11-16 19:03 ` bugzilla-daemon
@ 2018-11-20  5:57 ` bugzilla-daemon
  2018-11-21  0:38 ` bugzilla-daemon
                   ` (263 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-20  5:57 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

m@maltris.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |m@maltris.org

--- Comment #7 from m@maltris.org ---
Confirming for now, similar problems with 4.19.1, 4.19.2, 4.20-rc1, 4.20-rc2
and 4.20-rc3 from the Ubuntu mainline-kernel repository
(http://kernel.ubuntu.com/~kernel-ppa/mainline/).

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (6 preceding siblings ...)
  2018-11-20  5:57 ` bugzilla-daemon
@ 2018-11-21  0:38 ` bugzilla-daemon
  2018-11-21  0:41 ` bugzilla-daemon
                   ` (262 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-21  0:38 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

Jimmy.Jazz@gmx.net changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |Jimmy.Jazz@gmx.net

--- Comment #8 from Jimmy.Jazz@gmx.net ---
I could reproduce the issue. Probably not relevant but I had to modify the
initramfs script, 4.19.0 kernel for any reason changed the mdp raid major
number from 245 to 9 (i.e on a devtmpfs filesystem) and renamed them /dev/mdX
instead of /dev/md_dX as before. Partitions are now major 259. Also the md/lvm
devices became faster by the way. I tried ext4 mmp protection but without
success (i.e not a multi mount issue). 4.20.0-rc3 kernel gives the same sort of
issue. It is reproducible on an other amd64 machine with the same configuration
but different hardware.

Applications used for the test,
sys-fs/lvm2-2.02.173, sys-fs/mdadm-4.1, sys-fs/e2fsprogs-1.44.4,
sys-apps/coreutils-8.30, sys-apps/util-linux-2.33

More info in ext4_iget.txt.xz

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (7 preceding siblings ...)
  2018-11-21  0:38 ` bugzilla-daemon
@ 2018-11-21  0:41 ` bugzilla-daemon
  2018-11-21 14:48 ` bugzilla-daemon
                   ` (261 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-21  0:41 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #9 from Jimmy.Jazz@gmx.net ---
Created attachment 279557
  --> https://bugzilla.kernel.org/attachment.cgi?id=279557&action=edit
more infos (lspci, dmesg, etc.)

compressed text file

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (8 preceding siblings ...)
  2018-11-21  0:41 ` bugzilla-daemon
@ 2018-11-21 14:48 ` bugzilla-daemon
  2018-11-21 16:12 ` bugzilla-daemon
                   ` (260 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-21 14:48 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #10 from Jimmy.Jazz@gmx.net ---
(In reply to Jimmy.Jazz from comment #8)

typo: read -rc2 not -rc3

I'm using rc3 release now. /etc/mke2fs.conf default_mntopts user_xattr is
deactivated (tune2fs -o ^user_xattr /dev/mapper/xx) on all my lvm devices.
Native mdp devices still have the option set.
One of my machines is always on heavy load because of daily compilations basis
I do in a nilfs sandbox environment. No error for the moment.

The issue was all of sudden and affected all my lvm devices. Next reboot, fsck
randomly couldn't "see" any failure the kernel had reported but detection
improved when used with the -D optimization option.

It could be some old corruptions undetected until now. One of the server is
more then 5 years old without reinstall but still with regular updates. But the
other one still has the issue at its first installation.

The kernel was compiled with GCC and LD=ld.bfd
I was unsuccessful with CLANG.

# gcc --version
gcc (Gentoo 8.2.0-r4 p1.5) 8.2.0

I'm bit puzzled.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (9 preceding siblings ...)
  2018-11-21 14:48 ` bugzilla-daemon
@ 2018-11-21 16:12 ` bugzilla-daemon
  2018-11-21 17:26 ` bugzilla-daemon
                   ` (259 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-21 16:12 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #11 from Theodore Tso (tytso@mit.edu) ---
Thanks Jimmy for your report.  Can you specify what sort of LVM devices are you
using?   Is it just a standard LVM volume (e.g., no LVM raid, no LVM snapshops,
no dm-thin provisioning)?

The reason why I ask is because I've run gce-xfstests on 4.19, 4.19.1, and
4.19.2, and it uses LVM (nothing fancy just standard LVM volumes, although
xfstests will layer some dm-error and dm-thin on top of the LVM volumes for
specific xfstests) on top of virtio-scsi on top of Google Compute Engine's
Persistent Disks, and I'm not noticing any problems.

I just noticed that my .config file for my GCE testing has
CONFIG_SCSI_MQ_DEFAULT set to "no", which means I'm not using the new block-mq
data path.   So perhaps this is a MQ specific bug?   (Checking... hmm, my
laptop running 4.19.0 plus the ext4 commits landing in 4.20-rc2+ is *also*
using CONFIG_SCSI_MQ_DEFAULT=n.)   And Kconfig recommends that SCSI_MQ_DEFAULT
be defaulted to y.

This is why having people include their Kernel configs, and what devices they
use is so important.   The vast amount of time, given the constant testing
which we do in the ext4 layer, more often than not the problem is somewhere
*else* in the storage stack.   There have been bugs which have escaped notice
by our tests, yes.  But it's rare, and it's almost never the case when a large
number of users are reporting the same problem.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (10 preceding siblings ...)
  2018-11-21 16:12 ` bugzilla-daemon
@ 2018-11-21 17:26 ` bugzilla-daemon
  2018-11-21 18:17 ` bugzilla-daemon
                   ` (258 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-21 17:26 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

Claude Heiland-Allen (claude@mathr.co.uk) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
 Attachment #279431|0                           |1
        is obsolete|                            |

--- Comment #12 from Claude Heiland-Allen (claude@mathr.co.uk) ---
Created attachment 279569
  --> https://bugzilla.kernel.org/attachment.cgi?id=279569&action=edit
eiskaffee - logs and other info

I replaced my single dmesg attached with further information (as requested) in
a tarball.  Contains kernel configs, dmesg logs, tune2fs -l, lspci -vvv.  I
don't use LVM or MD on my machine (eiskaffee).  The file system corruption I
experienced was on the root partition on SSD.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (11 preceding siblings ...)
  2018-11-21 17:26 ` bugzilla-daemon
@ 2018-11-21 18:17 ` bugzilla-daemon
  2018-11-21 18:19 ` bugzilla-daemon
                   ` (257 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-21 18:17 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #13 from Jimmy.Jazz@gmx.net ---
(In reply to Theodore Tso from comment #11)
> Thanks Jimmy for your report.  Can you specify what sort of LVM devices are
> you using?  

Standard lvm linear volumes on top of a full /dev/md0p5 pv partition
on both machines with respectively kernel 4.19.2 and 4.20.0-rc3

> 
> I just noticed that my .config file for my GCE testing has
> CONFIG_SCSI_MQ_DEFAULT set to "no", which means I'm not using the new
> block-mq data path.   So perhaps this is a MQ specific bug?   (Checking...
> hmm, my laptop running 4.19.0 plus the ext4 commits landing in 4.20-rc2+ is
> *also* using CONFIG_SCSI_MQ_DEFAULT=n.)   And Kconfig recommends that
> SCSI_MQ_DEFAULT be defaulted to y.

CONFIG_SCSI_MQ_DEFAULT=y on both machines
CONFIG_DM_MQ_DEFAULT is not set

> This is why having people include their Kernel configs, and what devices
> they use is so important.

sorry it was an oversight. See attachments.

server with kernel 4.19.2
# smartctl -i /dev/sdb                                                 
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.2-radeon] (local build)

=== START OF INFORMATION SECTION ===
Device Model:     MKNSSDRE512GB
Serial Number:    MK15090210005157A
LU WWN Device Id: 5 888914 10005157a
Firmware Version: N1007C
User Capacity:    512 110 190 592 bytes [512 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Nov 21 19:02:07 2018 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

# smartctl -i /dev/sda                                                 
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.2-radeon] (local build)

=== START OF INFORMATION SECTION ===
Device Model:     MKNSSDRE512GB
Serial Number:    MK150902100051556
LU WWN Device Id: 5 888914 100051556
Firmware Version: N1007C
User Capacity:    512 110 190 592 bytes [512 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Nov 21 19:04:01 2018 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

server with 4.20.0-rc3

# smartctl -i /dev/sda                                                
=== START OF INFORMATION SECTION ===
Model Family:     HGST Travelstar 7K1000
Device Model:     HGST HTS721010A9E630
Serial Number:    JR10004M0BD4YF
LU WWN Device Id: 5 000cca 8a8c52dba
Firmware Version: JB0OA3J0
User Capacity:    1 000 204 886 016 bytes [1,00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      2.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 6
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Nov 21 19:02:21 2018 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

I do periodic backup and the strange is dm-7 is only accessed read only and it
just triggered ext4-iget failures.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (12 preceding siblings ...)
  2018-11-21 18:17 ` bugzilla-daemon
@ 2018-11-21 18:19 ` bugzilla-daemon
  2018-11-21 18:21 ` bugzilla-daemon
                   ` (256 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-21 18:19 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #14 from Jimmy.Jazz@gmx.net ---
Created attachment 279571
  --> https://bugzilla.kernel.org/attachment.cgi?id=279571&action=edit
dm-7 device w/ bad extra_isize errors

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (13 preceding siblings ...)
  2018-11-21 18:19 ` bugzilla-daemon
@ 2018-11-21 18:21 ` bugzilla-daemon
  2018-11-21 18:25 ` bugzilla-daemon
                   ` (255 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-21 18:21 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #15 from Jimmy.Jazz@gmx.net ---
Created attachment 279573
  --> https://bugzilla.kernel.org/attachment.cgi?id=279573&action=edit
dmesg w/ EXT4-fs error only

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (14 preceding siblings ...)
  2018-11-21 18:21 ` bugzilla-daemon
@ 2018-11-21 18:25 ` bugzilla-daemon
  2018-11-21 18:28 ` bugzilla-daemon
                   ` (254 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-21 18:25 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #16 from Jimmy.Jazz@gmx.net ---
Created attachment 279575
  --> https://bugzilla.kernel.org/attachment.cgi?id=279575&action=edit
.config of 4.19.2

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (15 preceding siblings ...)
  2018-11-21 18:25 ` bugzilla-daemon
@ 2018-11-21 18:28 ` bugzilla-daemon
  2018-11-21 18:44 ` bugzilla-daemon
                   ` (253 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-21 18:28 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #17 from Jimmy.Jazz@gmx.net ---
Created attachment 279577
  --> https://bugzilla.kernel.org/attachment.cgi?id=279577&action=edit
.config of 4.20.0-rc3

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (16 preceding siblings ...)
  2018-11-21 18:28 ` bugzilla-daemon
@ 2018-11-21 18:44 ` bugzilla-daemon
  2018-11-21 18:50 ` bugzilla-daemon
                   ` (252 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-21 18:44 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #18 from Jimmy.Jazz@gmx.net ---
kernel 4.19.2
md0 (formerly md_d0)

is a raid1 GPT bootable device

md0 : active raid1 sda[0] sdb[1]
      499976512 blocks super 1.2 [2/2] [UU]
      bitmap: 3/4 pages [12KB], 65536KB chunk

md0p1 is the grub boot partition

# fdisk -l /dev/md0                                                    
Disque /dev/md0 : 476,8 GiB, 511975948288 octets, 999953024 secteurs
Unités : secteur de 1 × 512 = 512 octets
Taille de secteur (logique / physique) : 512 octets / 512 octets
taille d'E/S (minimale / optimale) : 512 octets / 512 octets
Type d'étiquette de disque : gpt
Identifiant de disque : 9A6C46CD-3B9C-4C64-AE3C-EDB416548134

Périphérique   Début       Fin  Secteurs Taille Type
/dev/md0p1        40      2088      2049     1M Amorçage BIOS
/dev/md0p2      2096    264240    262145   128M Système de fichiers Linux
/dev/md0p3    264248   2361400   2097153     1G Système de fichiers Linux
/dev/md0p4   2361408   6555712   4194305     2G Système de fichiers Linux
/dev/md0p5   6555720 999952984 993397265 473,7G Système de fichiers Linux

# fdisk -l /dev/sdb                                                    
Disque /dev/sdb : 477 GiB, 512110190592 octets, 1000215216 secteurs
Modèle de disque : MKNSSDRE512GB   
Unités : secteur de 1 × 512 = 512 octets
Taille de secteur (logique / physique) : 512 octets / 512 octets
taille d'E/S (minimale / optimale) : 512 octets / 512 octets
Type d'étiquette de disque : dos
Identifiant de disque : 0x58e9a5ac

Périphérique Amorçage Début        Fin   Secteurs Taille Id Type
/dev/sdb1                 8 1000215215 1000215208   477G fd RAID Linux
autodétec

idem for second computer same configuration but one HGST Travelstar 7K1000 disk
attached and kernel cmdline has mdraid=forced

If I missed something just let me know.

Thx for your help.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (17 preceding siblings ...)
  2018-11-21 18:44 ` bugzilla-daemon
@ 2018-11-21 18:50 ` bugzilla-daemon
  2018-11-21 20:15 ` bugzilla-daemon
                   ` (251 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-21 18:50 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #19 from Jimmy.Jazz@gmx.net ---
(In reply to Theodore Tso from comment #11)

So perhaps this is a MQ specific bug?  

I checked old .config and I had CONFIG_SCSI_MQ_DEFAULT=y activated since
version 4.1.6.

MQ investigation will probably lead us to a dead end.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (18 preceding siblings ...)
  2018-11-21 18:50 ` bugzilla-daemon
@ 2018-11-21 20:15 ` bugzilla-daemon
  2018-11-21 20:36 ` bugzilla-daemon
                   ` (250 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-21 20:15 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #20 from Theodore Tso (tytso@mit.edu) ---
Can someone try 4.19.3?   I was working with another Ubuntu user who did *not*
have see the problem with 4.19.0, but did see it with 4.19.1, but one of the
differences in his config was:

-# CONFIG_SCSI_MQ_DEFAULT is not set
+CONFIG_SCSI_MQ_DEFAULT=y

Furthermore, he tried 4.19.3 and after two hours of heavy I/O, he's no longer
seeing problems.   Based on the above observation, his theory is this commit
may have fixed things, and it *is* blk-mq specific:

commit 410306a0f2baa5d68970cdcf6763d79c16df5f23
Author: Ming Lei <ming.lei@redhat.com>
Date:   Wed Nov 14 16:25:51 2018 +0800

    SCSI: fix queue cleanup race before queue initialization is done

    commit 8dc765d438f1e42b3e8227b3b09fad7d73f4ec9a upstream.

    c2856ae2f315d ("blk-mq: quiesce queue before freeing queue") has
    already fixed this race, however the implied synchronize_rcu()
    in blk_mq_quiesce_queue() can slow down LUN probe a lot, so caused
    performance regression.

    Then 1311326cf4755c7 ("blk-mq: avoid to synchronize rcu inside
blk_cleanup_queue()")
    tried to quiesce queue for avoiding unnecessary synchronize_rcu()
    only when queue initialization is done, because it is usual to see
    lots of inexistent LUNs which need to be probed.

    However, turns out it isn't safe to quiesce queue only when queue
    initialization is done. Because when one SCSI command is completed,
    the user of sending command can be waken up immediately, then the
    scsi device may be removed, meantime the run queue in scsi_end_request()
    is still in-progress, so kernel panic can be caused.

    In Red Hat QE lab, there are several reports about this kind of kernel
    panic triggered during kernel booting.

    This patch tries to address the issue by grabing one queue usage
    counter during freeing one request and the following run queue.

This commit just landed in mainline and is not in 4.20-rc2, so the theory that
it was a blk-mq bug that was fixed by the above commit is consistent with all
of the observations made to date.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (19 preceding siblings ...)
  2018-11-21 20:15 ` bugzilla-daemon
@ 2018-11-21 20:36 ` bugzilla-daemon
  2018-11-21 22:06 ` bugzilla-daemon
                   ` (249 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-21 20:36 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #21 from Claude Heiland-Allen (claude@mathr.co.uk) ---
My kernels have this:

    4.18.19.config:CONFIG_SCSI_MQ_DEFAULT=y
    4.19.2.config:CONFIG_SCSI_MQ_DEFAULT=y
    4.20-rc3.config:CONFIG_SCSI_MQ_DEFAULT=y

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (20 preceding siblings ...)
  2018-11-21 20:36 ` bugzilla-daemon
@ 2018-11-21 22:06 ` bugzilla-daemon
  2018-11-21 22:59 ` bugzilla-daemon
                   ` (248 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-21 22:06 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #22 from Jimmy.Jazz@gmx.net ---
I will ... if I can reboot safely. This time it affects / (i.e /dev/md0p3)
What a nightmare.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (21 preceding siblings ...)
  2018-11-21 22:06 ` bugzilla-daemon
@ 2018-11-21 22:59 ` bugzilla-daemon
  2018-11-22  1:47 ` bugzilla-daemon
                   ` (247 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-21 22:59 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #23 from Theodore Tso (tytso@mit.edu) ---
Jimmy, I don't blame you.   Unfortunately, I don't have a clean repro of the
problem because when I tried building a 4.20-rc2 kernel with
CONFIG_SCSI_MQ_DEFAULT=y, and tried running gce-xfstests, no problems were
detected.  And I'm too chicken to try running a kernel version which does have
the problem reported with CONFIG_SCSI_MQ_DEFAULT=y on my primary development
laptop.   :-)

I will say that if you are seeing problems on a particular file system (e.g.
/), by the time the kernel is reporting inconsistencies, the damage is already
done. Yes, you might want to try doing a backup before you reboot, in case the
system doesn't come back, but realistically speaking, the longer you keep
running, the problems are more likely to compound.

So from a personally very selfish perspective, I'm hoping someone who has
already suffered corruption problems is willing to try either 4.19.3, or
disabling CONFIG_SCSI_MQ_DEFAULT, or both, and report that they are no longer
seeing problems, than my putting my own personal data at risk....

Maybe over T-day weekend, I'll try doing a full backup, and then try using
4.19.3 on my personal laptop --- but a "it works fine for me" report won't
necessarily mean anything, since to date I'm not able to reproduce the problem
on one of my systems.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (22 preceding siblings ...)
  2018-11-21 22:59 ` bugzilla-daemon
@ 2018-11-22  1:47 ` bugzilla-daemon
  2018-11-22  2:02 ` bugzilla-daemon
                   ` (246 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-22  1:47 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

Jens Axboe (axboe@kernel.dk) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |axboe@kernel.dk

--- Comment #24 from Jens Axboe (axboe@kernel.dk) ---
It'd be critical to know if 4.19.3 is still showing the issue with MQ being on.
I'm going to try my luck at reproducing this issue as well, but given that
there hasn't been a lot of noise about it, not sure I'll have too much luck.

I've got a few suspects, so I'm also willing to spin a patch against 4.19.3 if
folks are willing to give that a go.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (23 preceding siblings ...)
  2018-11-22  1:47 ` bugzilla-daemon
@ 2018-11-22  2:02 ` bugzilla-daemon
  2018-11-22  2:08 ` bugzilla-daemon
                   ` (245 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-22  2:02 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #25 from Jens Axboe (axboe@kernel.dk) ---
Ted, it seems to be affecting nvme as well, so there's really no escaping for
you. But there has to be some other deciding factor here, or all the block
testing would surely have caught this. Question is just what it is.

What would be the most helpful is if someone who can reproduce this at well
could run a bisect between 4.18 and 4.19 to figure out wtf is going on here.
This commit:

commit 410306a0f2baa5d68970cdcf6763d79c16df5f23
Author: Ming Lei <ming.lei@redhat.com>
Date:   Wed Nov 14 16:25:51 2018 +0800

    SCSI: fix queue cleanup race before queue initialization is done

might explain the SCSI issues seen, but the very first comment is from someone
using nvme that the above patch has no bearing on that at all. It is, however,
possible that some of the queue sync patches caused a blk-mq issue, and that is
why nvme is affected as well, and why the above commit seems to fix things on
the SCSI side.

I'm going to attach a patch here for 4.19 and it'd be great if folks could try
that.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (24 preceding siblings ...)
  2018-11-22  2:02 ` bugzilla-daemon
@ 2018-11-22  2:08 ` bugzilla-daemon
  2018-11-22  2:08 ` bugzilla-daemon
                   ` (244 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-22  2:08 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #26 from Jens Axboe (axboe@kernel.dk) ---
Created attachment 279579
  --> https://bugzilla.kernel.org/attachment.cgi?id=279579&action=edit
4.19 patch

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (25 preceding siblings ...)
  2018-11-22  2:08 ` bugzilla-daemon
@ 2018-11-22  2:08 ` bugzilla-daemon
  2018-11-22  2:37 ` bugzilla-daemon
                   ` (243 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-22  2:08 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #27 from Jens Axboe (axboe@kernel.dk) ---
Created attachment 279581
  --> https://bugzilla.kernel.org/attachment.cgi?id=279581&action=edit
4.20-rc3 patch

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (26 preceding siblings ...)
  2018-11-22  2:08 ` bugzilla-daemon
@ 2018-11-22  2:37 ` bugzilla-daemon
  2018-11-22  9:14 ` bugzilla-daemon
                   ` (242 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-22  2:37 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #28 from Jens Axboe (axboe@kernel.dk) ---
If it's not this, another hint might be a discard change. Is everyone affected
using discard?

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (27 preceding siblings ...)
  2018-11-22  2:37 ` bugzilla-daemon
@ 2018-11-22  9:14 ` bugzilla-daemon
  2018-11-22 11:51 ` bugzilla-daemon
                   ` (241 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-22  9:14 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #29 from carlphilippreh@gmail.com ---
Sorry for the late response, but I have been trying to reproduce the problem
with 4.19.2 for some while now. It seems that the problem I was experiencing
only happens with 4.19.1 and 4.19.0, and it did so very frequently. I can at
least confirm that I have CONFIG_SCSI_MQ_DEFAULT=y set in 4.19 but I didn't in
4.18. I hope that this is, at least for me, fixed for now.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (28 preceding siblings ...)
  2018-11-22  9:14 ` bugzilla-daemon
@ 2018-11-22 11:51 ` bugzilla-daemon
  2018-11-22 15:22 ` bugzilla-daemon
                   ` (240 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-22 11:51 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

Henrique Rodrigues (henrique.rodrigues@ist.utl.pt) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |henrique.rodrigues@ist.utl.
                   |                            |pt

--- Comment #30 from Henrique Rodrigues (henrique.rodrigues@ist.utl.pt) ---
(In reply to Theodore Tso from comment #23)
> So from a personally very selfish perspective, I'm hoping someone who has
> already suffered corruption problems is willing to try either 4.19.3, or
> disabling CONFIG_SCSI_MQ_DEFAULT, or both, and report that they are no
> longer seeing problems, than my putting my own personal data at risk....

On a Ubuntu 18.10 machine I've upgraded to 4.19.0 and started getting these
corruption errors. Yesterday I've upgraded to 4.19.3 and was still getting
corrupted. 4.18 was fine. Unfortunately the latest corruption rendered the
operating system unbootable. I'm going to try and fix it tonight and then will
try to disable CONFIG_SCSI_MQ_DEFAULT and test.

I'm in a somewhat fortunate position since the data that I care about lives on
another disk with a different filesystem type, so the corruption on the root
filesystem is just annoying and not really that dangerous.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (29 preceding siblings ...)
  2018-11-22 11:51 ` bugzilla-daemon
@ 2018-11-22 15:22 ` bugzilla-daemon
  2018-11-22 15:29 ` bugzilla-daemon
                   ` (239 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-22 15:22 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

Azat Khuzhin (a3at.mail@gmail.com) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |a3at.mail@gmail.com

--- Comment #31 from Azat Khuzhin (a3at.mail@gmail.com) ---
I'm also running 4.20-rc2 but does not experience any corruption for now
(*crossing fingers*)

(In reply to Theodore Tso from comment #11)
> hmm, my laptop running 4.19.0 plus the ext4 commits landing in 4.20-rc2+ is
> *also* using CONFIG_SCSI_MQ_DEFAULT=n

But I do have CONFIG_SCSI_MQ_DEFAULT:

$ zgrep CONFIG_SCSI_MQ_DEFAULT=y /proc/config.gz
CONFIG_SCSI_MQ_DEFAULT=y
$ head /sys/block/dm-0/dm/use_blk_mq
1

(In reply to Jens Axboe from comment #25)
> Ted, it seems to be affecting nvme as well

And I do have nvme ssd:

nvme0n1       259:0    0 953.9G  0 disk
├─nvme0n1p1   259:1    0   260M  0 part  /boot
└─nvme0n1p2   259:2    0 953.6G  0 part
  └─cryptroot 254:0    0 953.6G  0 crypt /

And as you can see I have dm-crypt

So it looks like that this is not that simple (IOW not every setup/env/hw
affected).

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (30 preceding siblings ...)
  2018-11-22 15:22 ` bugzilla-daemon
@ 2018-11-22 15:29 ` bugzilla-daemon
  2018-11-22 17:04 ` bugzilla-daemon
                   ` (238 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-22 15:29 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #32 from Azat Khuzhin (a3at.mail@gmail.com) ---
(In reply to Jens Axboe from comment #28)
> If it's not this, another hint might be a discard change. Is everyone
> affected using discard?

And what a coincidence, before upgrading to 4.20-rc2 I enabled discard:

# findmnt /
TARGET SOURCE                FSTYPE OPTIONS
/      /dev/mapper/cryptroot ext4   rw,relatime,discard

# cat /proc/cmdline
cryptdevice=...:cryptroot:allow-discards

# cryptsetup status cryptroot
/dev/mapper/cryptroot is active and is in use.
  ...
  flags:   discards

Plus I triggered fstrim manually at start:

# systemctl status fstrim
Nov 19 00:16:14 azat fstrim[23944]: /boot: 122.8 MiB (128716800 bytes) trimmed
on /dev/nvme0n1p1
Nov 19 00:16:14 azat fstrim[23944]: /: 0 B (0 bytes) trimmed on
/dev/mapper/cryptroot

But what is interesting here is that it did not do any discard for the "/", hm
(does ext4 did it for me at start?)

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (31 preceding siblings ...)
  2018-11-22 15:29 ` bugzilla-daemon
@ 2018-11-22 17:04 ` bugzilla-daemon
  2018-11-22 19:38 ` bugzilla-daemon
                   ` (237 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-22 17:04 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #33 from Jimmy.Jazz@gmx.net ---
(In reply to Jens Axboe from comment #28)
> If it's not this, another hint might be a discard change. Is everyone
> affected using discard?

All 'cat /sys/block/dm-*/dm/use_blk_mq' are zero. Could MQ still be a suspect ?

I reproduced the issue with 4.19.3 as well but without your patch.
The difference is, it happens less often but still under heavy load (hours of
work, mostly compilations and monitoring).

The strange is, the affected disks are not obliged to be under load and on the
next reboot fsck -f show some of them as clean despite they were declared with
ext4_iget corruptions (tested during reboot from 4.19.2 to 4.19.3 kernel)!

It's like some shared fs cache failure to me with unpleasant consequences.

Disabling user_xattr seems to be more helpful with 4.20.0-rc3 anyway, no error
since. Actually not under heavy load.

Failures appear also on an plain old HDD device. For me, SSD discard is more a
consequence as a reason but it's worse investigating it.

I will try your patch ASAP.

Thx

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (32 preceding siblings ...)
  2018-11-22 17:04 ` bugzilla-daemon
@ 2018-11-22 19:38 ` bugzilla-daemon
  2018-11-22 19:57 ` bugzilla-daemon
                   ` (236 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-22 19:38 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

Bart Van Assche (bvanassche@acm.org) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |bvanassche@acm.org

--- Comment #34 from Bart Van Assche (bvanassche@acm.org) ---
I hit filesystem corruption with a desktop system running openSUSE Tumbleweed,
kernel v4.19.3 and ext4 on top of a SATA SSD with scsi_mod.use_blk_mq=Y in
/proc/cmdline. Discard was not enabled in /etc/fstab. After having enabled
fsck.mode=force the following appeared in the system log after a reboot:

/dev/sda2: Inode 12190197 extent tree (at level 2) could be narrower.  IGNORED.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (33 preceding siblings ...)
  2018-11-22 19:38 ` bugzilla-daemon
@ 2018-11-22 19:57 ` bugzilla-daemon
  2018-11-22 20:03 ` bugzilla-daemon
                   ` (235 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-22 19:57 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

Reindl Harald (harry@rhsoft.net) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |harry@rhsoft.net

--- Comment #35 from Reindl Harald (harry@rhsoft.net) ---
> /dev/sda2: Inode 12190197 extent tree (at level 2) could be narrower. 
> IGNORED

that is completly unrelated, i see that for years now on several machines and
not cleaned up automatically and wasting my time to boot in rescure mode is not
worth given the low importance of "could"

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (34 preceding siblings ...)
  2018-11-22 19:57 ` bugzilla-daemon
@ 2018-11-22 20:03 ` bugzilla-daemon
  2018-11-23  0:02 ` bugzilla-daemon
                   ` (234 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-22 20:03 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #36 from Bart Van Assche (bvanassche@acm.org) ---
(In reply to Reindl Harald from comment #35)
> > /dev/sda2: Inode 12190197 extent tree (at level 2) could be narrower. 
> > IGNORED
> 
> that is completly unrelated, i see that for years now on several machines
> and not cleaned up automatically and wasting my time to boot in rescure mode
> is not worth given the low importance of "could"

That's good to know. The reason I commented on this bug report and that I
replied that I hit data corruption is because my workstation failed to boot due
to fsck not being able to repair the file system automatically. I had to run
fsck manually, answer a long list of scary questions and reboot.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (35 preceding siblings ...)
  2018-11-22 20:03 ` bugzilla-daemon
@ 2018-11-23  0:02 ` bugzilla-daemon
  2018-11-24 12:08 ` bugzilla-daemon
                   ` (233 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-23  0:02 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #37 from Jimmy.Jazz@gmx.net ---
@Jen Axboe

please read worth not worse in comment 33

I tried your patch for 4.19.3 and still get quite harmful ext4 errors like this
one,

EXT4-fs error (device dm-4): ext4_xattr_ibody_get:592: inode #4881425: comm
rsync: corrupted in-inode xattr

The filesystems were clean at boot time and the system was idle.

tune2fs ends with
FS Error count:           64
First error time:         Fri Nov 23 00:19:25 2018
First error function:     ext4_xattr_ibody_get
First error line #:       592
First error inode #:      4881425
First error block #:      0
Last error time:          Fri Nov 23 00:19:25 2018
Last error function:      ext4_xattr_ibody_get
Last error line #:        592
Last error inode #:       4881430
Last error block #:       0
MMP block number:         9255
MMP update interval:      5

If you are interested in its dumpe2fs result let me know.

I don't use binary modules.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (36 preceding siblings ...)
  2018-11-23  0:02 ` bugzilla-daemon
@ 2018-11-24 12:08 ` bugzilla-daemon
  2018-11-24 13:07 ` bugzilla-daemon
                   ` (232 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-24 12:08 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #38 from Rainer Fiebig (jrf@mailbox.org) ---
(In reply to Jimmy.Jazz from comment #37)
> @Jen Axboe
> 
> please read worth not worse in comment 33
> 
> I tried your patch for 4.19.3 and still get quite harmful ext4 errors like
> this one,
> 
> EXT4-fs error (device dm-4): ext4_xattr_ibody_get:592: inode #4881425: comm
> rsync: corrupted in-inode xattr
> 
> The filesystems were clean at boot time and the system was idle.
> 
> tune2fs ends with
> FS Error count:           64
> First error time:         Fri Nov 23 00:19:25 2018
> First error function:     ext4_xattr_ibody_get
> First error line #:       592
> First error inode #:      4881425
> First error block #:      0
> Last error time:          Fri Nov 23 00:19:25 2018
> Last error function:      ext4_xattr_ibody_get
> Last error line #:        592
> Last error inode #:       4881430
> Last error block #:       0
> MMP block number:         9255
> MMP update interval:      5
> 
> If you are interested in its dumpe2fs result let me know.
> 
> I don't use binary modules.

Jimmy,

what *I* would do if I were in your shoes is 

- run a kernel < 4.19, make sure the fs is OK and *backup important data* 
- compile 4.19.3 with CONFIG_SCSI_MQ_DEFAULT *not* set

and see what happens. If you still get corruption, CONFIG_SCSI_MQ_DEFAULT
probably is not the culprit. If not, it has at least something to do with it.

It seems that CONFIG_SCSI_MQ_DEFAULT *not* set was the default <= 4.18.19.

Others here obviously don't have problems with CONFIG_SCSI_MQ_DEFAULT=y and
kernels >= 4.19, but you never know.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (37 preceding siblings ...)
  2018-11-24 12:08 ` bugzilla-daemon
@ 2018-11-24 13:07 ` bugzilla-daemon
  2018-11-24 14:10 ` bugzilla-daemon
                   ` (231 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-24 13:07 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

AdamB (abennett72@gmail.com) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |abennett72@gmail.com

--- Comment #39 from AdamB (abennett72@gmail.com) ---
I also experienced an ext4 file system corruption with 4.19.1, after resuming
from suspend-to-ram.

I've ran 4.12, 13, 14, 16, 17, and 18 on the same machine with near identical
.config and never had a file system corruption.

For all those kernels, I've had CONFIG_SCSI_MQ_DEFAULT=y.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (38 preceding siblings ...)
  2018-11-24 13:07 ` bugzilla-daemon
@ 2018-11-24 14:10 ` bugzilla-daemon
  2018-11-25  7:59 ` bugzilla-daemon
                   ` (230 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-24 14:10 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #40 from Rainer Fiebig (jrf@mailbox.org) ---
(In reply to AdamB from comment #39)
> I also experienced an ext4 file system corruption with 4.19.1, after
> resuming from suspend-to-ram.
> 
> I've ran 4.12, 13, 14, 16, 17, and 18 on the same machine with near
> identical .config and never had a file system corruption.
> 
> For all those kernels, I've had CONFIG_SCSI_MQ_DEFAULT=y.

I can say the same for CONFIG_SCSI_MQ_DEFAULT=n.
But this is not exactly the same as running 4.19.x with
CONFIG_SCSI_MQ_DEFAULT=n.

As someone already pointed out: the best way to find out what's behind this is
bisecting between 4.18 and 4.19 by someone affected by the problem. This is
time consuming but in the end may also be the fastest way. A backup is IMO
mandatory in this case.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (39 preceding siblings ...)
  2018-11-24 14:10 ` bugzilla-daemon
@ 2018-11-25  7:59 ` bugzilla-daemon
  2018-11-25  8:02 ` bugzilla-daemon
                   ` (229 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-25  7:59 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

HB (hb@testwelt.deneb.uberspace.de) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |hb@testwelt.deneb.uberspace
                   |                            |.de

--- Comment #41 from HB (hb@testwelt.deneb.uberspace.de) ---
I have this bug with ubuntu 18.04 kernel 4.15.0-39, too.

My Desktop: 
SSD (Samsung 840) with three partions:
/boot : ext2
/     : ext4
swap

HDD1: one ext4 partition
HDD2: luks encrypted, never mounted at boot time and not used when the error
happens.

No Raid-Stuff used.

The problems only occurs on the ext4 part. from the ssd.

Sometimes at booting there are some message like "could not access ata
devices", there are some timeouts with ATA-commands. It retries several times
until it gives up, I dont reach the busy box command line.

Sometimes I reach the busybox command line but cant fix it there, because there
is no fsck in busybox.

I have to connect the ssh to a notebook via usb2sata Adapter, the two partions
were recognised without problems and are in most cases automounted. If I force
a fsck there are some orphaned inodes discovered and the fs is fixed.

After this I can boot from this SSD in the desktop without problems until it
happens again.

The weired thing is that sometimes the SSD is not recognised after this and has
this ATA-Timeouts above.

Even turning the desktop completeley powerless (disconnecting from power socket
and waiting some minutes then doing a cold boot) it gets stuck the same way.

On the notbook were I fix the SSD is the same OS installed and there never
occured this type of problem. Maybe I had only luck until now, I dont use the
notebook very much.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (40 preceding siblings ...)
  2018-11-25  7:59 ` bugzilla-daemon
@ 2018-11-25  8:02 ` bugzilla-daemon
  2018-11-25 21:47 ` bugzilla-daemon
                   ` (228 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-25  8:02 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #42 from HB (hb@testwelt.deneb.uberspace.de) ---
add: I (In reply to HB from comment #41)
> I have to connect the ssh 
I mean the "SSD" not ssh.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (41 preceding siblings ...)
  2018-11-25  8:02 ` bugzilla-daemon
@ 2018-11-25 21:47 ` bugzilla-daemon
  2018-11-25 22:06 ` bugzilla-daemon
                   ` (227 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-25 21:47 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #43 from Jimmy.Jazz@gmx.net ---
I do some bisecting on the linux-master git source.
I'm at the kernel version 4.19.0-rc2-radeon-00922-gf48097d294-dirty currently.
I hope that all the fscks didn't make my system immune to this issue :)

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (42 preceding siblings ...)
  2018-11-25 21:47 ` bugzilla-daemon
@ 2018-11-25 22:06 ` bugzilla-daemon
  2018-11-25 22:24 ` bugzilla-daemon
                   ` (226 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-25 22:06 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #44 from Jens Axboe (axboe@kernel.dk) ---
Thanks a lot, Jimmy! That's what we need to make some progress here, in lieu of
me and/or Ted being able to reproduce this issue.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (43 preceding siblings ...)
  2018-11-25 22:06 ` bugzilla-daemon
@ 2018-11-25 22:24 ` bugzilla-daemon
  2018-11-26  0:00 ` bugzilla-daemon
                   ` (225 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-25 22:24 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #45 from Henrique Rodrigues (henrique.rodrigues@ist.utl.pt) ---
(In reply to Rainer Fiebig from comment #38)
> what *I* would do if I were in your shoes is 
> 
> - run a kernel < 4.19, make sure the fs is OK and *backup important data* 
> - compile 4.19.3 with CONFIG_SCSI_MQ_DEFAULT *not* set
> 
> and see what happens.


This is what I did: recompile my 4.19.3 kernel with CONFIG_SCSI_MQ_DEFAULT=n.
I've used the computer normally, ran some heavy read/write operations, rebooted
a bunch of times and had no problems since then. I'm on Ubuntu 18.10 with an
SSD, encrypted LUKS root partition.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (44 preceding siblings ...)
  2018-11-25 22:24 ` bugzilla-daemon
@ 2018-11-26  0:00 ` bugzilla-daemon
  2018-11-26  0:04 ` bugzilla-daemon
                   ` (224 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-26  0:00 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #46 from Theodore Tso (tytso@mit.edu) ---
So Henrique, the only difference between the 4.19.3 kernel that worked and the
one where you didn't see corruption was CONFIG_SCSI_MQ_DEFAULT?   Can you diff
the two configs to be sure?

What can you tell us about the SSD?  Is it a SATA-attached SSD, or
NVMe-attached?

What I can report is my personal development laptop is running 4.19.0 (plus the
ext4 patches that landed in 4.20-rc1) with CONFIG_SCSI_MQ_DEFAULT=n?  (Although
as others have pointed out, that shouldn't matter since my SSD is
NVMe-attached, and so it doesn't go through the SCSI stack.)   My laptop runs
Debian unstable, and uses an encrypted LUKS partition on top of which I use
LVM.   I do use regular suspend-to-ram (not suspend-to-idle, since that burns
way too much power; there's a kernel BZ open on that issue) since it is a
laptop.

I have also run xfstest runs using 4.19.0, 4.19.1, 4.19.2, and 4.20-rc2 with
CONFIG_SCSI_MQ_DEFAULT=n; it's using the gce-xfstests[1] test appliance which
means I'm using virtio-SCSI on top of LVM, and it runs a large number of
regression tests, many with heavy read/write loads, but none of the file
systems is mounted for more than 5-6 minutes before we unmount and then run
fsck on it.  We do *not* do any suspend/resumes, although we do test the file
system side of suspend/resume using the freeze and thaw ioctls.  There were no
unusual problems noticed.  

[1] https://thunk.org/gce-xfstests

I have also run gce-xfstests on 4.20-rc2 with CONFIG_SCSI_MQ_DEFAULT=y, with
the same configuration as above --- vrtio-scsi with LVM on top.   There was
nothing unusual that was detected there.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (45 preceding siblings ...)
  2018-11-26  0:00 ` bugzilla-daemon
@ 2018-11-26  0:04 ` bugzilla-daemon
  2018-11-26  8:49 ` bugzilla-daemon
                   ` (223 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-26  0:04 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #47 from Theodore Tso (tytso@mit.edu) ---
Bart, in #34, was the only thing which e2fsck reported this:

/dev/sda2: Inode 12190197 extent tree (at level 2) could be narrower.  IGNORED.

That's not a file system problem; it's a potential optimization which e2fsck
detected, which would eliminate a random 4k read when running random read
workload against that inode.  If you don't want to see this, you can use
e2fsck's  "-E fixes_only" option.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (46 preceding siblings ...)
  2018-11-26  0:04 ` bugzilla-daemon
@ 2018-11-26  8:49 ` bugzilla-daemon
  2018-11-26 12:23 ` bugzilla-daemon
                   ` (222 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-26  8:49 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #48 from Rainer Fiebig (jrf@mailbox.org) ---
(In reply to Jimmy.Jazz from comment #43)
> I do some bisecting on the linux-master git source.
> I'm at the kernel version 4.19.0-rc2-radeon-00922-gf48097d294-dirty
> currently. I hope that all the fscks didn't make my system immune to this
> issue :)

Great! Everyone who's had his share of bisecting knows to value you effort! ;)

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (47 preceding siblings ...)
  2018-11-26  8:49 ` bugzilla-daemon
@ 2018-11-26 12:23 ` bugzilla-daemon
  2018-11-26 12:24 ` bugzilla-daemon
                   ` (221 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-26 12:23 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #49 from Jimmy.Jazz@gmx.net ---
to be short,

Release 4.19.0-rc2-radeon-00922-gf48097d294-dirty

A: Nothing append at first when the computer is nearly idle.

B: I mounted an usb SD media first ro (default) then rw.
Transfer to it some big files (cp and tar) from two different xterms.
Lot of errors, stick became for the kernel read only.
Transfer failed.
Umount then remount the filesystem without doing an fsck and restart the
transfer again.
Transfer ok.
umount ok.

I will next declare it git bisect bad and then reboot.

please see attachements.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (48 preceding siblings ...)
  2018-11-26 12:23 ` bugzilla-daemon
@ 2018-11-26 12:24 ` bugzilla-daemon
  2018-11-26 12:25 ` bugzilla-daemon
                   ` (220 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-26 12:24 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #50 from Jimmy.Jazz@gmx.net ---
Created attachment 279655
  --> https://bugzilla.kernel.org/attachment.cgi?id=279655&action=edit
dmesg EXT4 errors

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (49 preceding siblings ...)
  2018-11-26 12:24 ` bugzilla-daemon
@ 2018-11-26 12:25 ` bugzilla-daemon
  2018-11-26 15:49 ` bugzilla-daemon
                   ` (219 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-26 12:25 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #51 from Jimmy.Jazz@gmx.net ---
Created attachment 279657
  --> https://bugzilla.kernel.org/attachment.cgi?id=279657&action=edit
dumpe2fs

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (50 preceding siblings ...)
  2018-11-26 12:25 ` bugzilla-daemon
@ 2018-11-26 15:49 ` bugzilla-daemon
  2018-11-26 16:34 ` bugzilla-daemon
                   ` (218 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-26 15:49 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #52 from Jimmy.Jazz@gmx.net ---
FYI, I need 2 patches for my initramfs to generate and IMO should not
interfere.

drivers/tty/vt/defkeymap.map to get the fr kbd mapping
usr/Makefile due to shell evaluation

--- usr/Makefile.orig   2017-02-19 23:34:00.000000000 +0100
+++ usr/Makefile        2017-02-22 23:44:24.554921038 +0100
@@ -43,7 +43,7 @@
 targets := $(datafile_y)

 # do not try to update files included in initramfs
-$(deps_initramfs): ;
+$(deps_initramfs): ; 

 $(deps_initramfs): klibcdirs
 # We rebuild initramfs_data.cpio if:
@@ -52,5 +52,6 @@
 # 3) If gen_init_cpio are newer than initramfs_data.cpio
 # 4) arguments to gen_initramfs.sh changes
 $(obj)/$(datafile_y): $(obj)/gen_init_cpio $(deps_initramfs) klibcdirs
-       $(Q)$(initramfs) -l $(ramfs-input) > $(obj)/$(datafile_d_y)
+       $(Q)$(initramfs) -l $(ramfs-input) | \
+       sed '2,$$s/:/\\:/g' > $(obj)/$(datafile_d_y)
        $(call if_changed,initfs)


[quote]
I don't want to break T.Tso rules, but I remember, I have encountered a similar
issue when I initially tried partitionable array with major 9. At that time I
switched to major 254 as explain in comment 8 and the problem didn't come up
since... until the recent kernel 4.19 with mdadm 4.1 and kernel devtmpfs that
switched the metadevices to major 9. Also, why? A big mystery.
[/quote]

Now to the fact,
I was able to reboot in rescue mode, I use the world service to illustrate the
process. Nothing to do with Debian.

# service mdraid start
# service vg0 start
# cd /dev/mapper
# for i in *-*; do fsck /dev/mapper/$i; done
All clean except sys-scm (f  word)
the usb stick is clean too.
I need a terminal for interactive repairs so I write the beginning by hand.

Inode 58577 has extra size (103) which is invalid
Fix<y>? yes
Timestamp(s) on inode 58577 beyond 2310-04-04 are likely pre-1970
+ 9 others
Inodes that were part of corrupted orphan linked list found. Fix<y>?yes
+ 3 others
i_size is 139685221367808, shoud be 0.
i_blocks is 32523, should be 0.
+ 22 others
Pass 2: checking directory structure
Inode 58577 (/git/toolkit.git/objects/e0) has invalid mode (0150)
+ 9 others
Unattached inode 17013
Connect to /lost+found
Inode 17013 ref count is 2, should be 1
+ 35 others
Inode 58586 (...) has invalid mode (0122)
+ 5 others
[...]
Unattached inode 262220
Connect to /lost+found<y>? yes
Inode 262220 ref count is 2, should be 1.  Fix<y>? yes
Pass 5: Checking group summary information
Block bitmap differences:  -(9252--9255) -10490 -(10577--10578) -(16585--16589)
-295391 -(682164--682165)
Fix<y>? yes
Free blocks count wrong for group #0 (2756, counted=2768).
Fix<y>?
Block bitmap differences:  -(9252--9255) -10490 -(10577--10578) -(16585--16589)
-295391 -(682164--682165)
Fix<y>? yes
Free blocks count wrong for group #0 (2756, counted=2768).
Fix<y>? yes
Free blocks count wrong for group #9 (2784, counted=2785).
Fix<y>? yes
Free blocks count wrong for group #20 (14702, counted=14704).
Fix<y>? yes
Free blocks count wrong (718736, counted=718751).
Fix<y>? yes
Inode bitmap differences:  -58591
Fix<y>? yes
Free inodes count wrong for group #7 (6283, counted=6284).
Fix<y>? yes
Directories count wrong for group #7 (1133, counted=1121).
Fix<y>? yes
Free inodes count wrong (322025, counted=322026).
Fix<y>? yes

scm: ***** FILE SYSTEM WAS MODIFIED *****
scm: 71190/393216 files (0.1% non-contiguous), 854113/1572864 blocks
fsck from util-linux 2.32.1
e2fsck 1.44.4 (18-Aug-2018)

service vg0 stop
service mdraid stop

ctrl-alt-del

I was able to reboot with init 1 from grub then init 4 from tty1 as root with
kernel 4.19.4. Filesystems were clean.

exit from tty1

log in again as normal user under X
su - root
next bisect in action

I must admit, it is time consuming.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (51 preceding siblings ...)
  2018-11-26 15:49 ` bugzilla-daemon
@ 2018-11-26 16:34 ` bugzilla-daemon
  2018-11-27  1:32 ` bugzilla-daemon
                   ` (217 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-26 16:34 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #53 from Rainer Fiebig (jrf@mailbox.org) ---
[...]
> 
> I must admit, it is time consuming.

You have been warned. ;)

But in the end you will be rewarded with something like this:

> git bisect good
1234xx56789yy is the first bad commit
...

And honors and glory will rain down on you!
OK, this may be a bit exaggerated. ;)

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (52 preceding siblings ...)
  2018-11-26 16:34 ` bugzilla-daemon
@ 2018-11-27  1:32 ` bugzilla-daemon
  2018-11-27 12:24 ` bugzilla-daemon
                   ` (216 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-27  1:32 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #54 from Henrique Rodrigues (henrique.rodrigues@ist.utl.pt) ---
(In reply to Theodore Tso from comment #46)
> So Henrique, the only difference between the 4.19.3 kernel that worked and
> the one where you didn't see corruption was CONFIG_SCSI_MQ_DEFAULT?   Can
> you diff the two configs to be sure?

The bad news is that I've seemed to have made a mistake and there are more
changes than that one.

The other bad news is that I got another corruption even with
CONFIG_SCSI_MQ_DEFAULT=n.


> What can you tell us about the SSD?  Is it a SATA-attached SSD, or
> NVMe-attached?

It's a SATA attached SSD.

I'll attach more information (dmesg, lspci, kernel config, etc). Unfortunately
fsck now tells me I've got a bad magic number in super-block, so I think I
better start copying some stuff over to another disk before attempting anything
else.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (53 preceding siblings ...)
  2018-11-27  1:32 ` bugzilla-daemon
@ 2018-11-27 12:24 ` bugzilla-daemon
  2018-11-27 17:02 ` bugzilla-daemon
                   ` (215 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-27 12:24 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #55 from Jimmy.Jazz@gmx.net ---
I didn't make it with 4.18.0-radeon-07013-g54dbe75bbf-dirty because the radeon
module gives me a black screen.

With 4.18.0-radeon-03131-g0a957467c5-dirty, ext4 filesystems were stable but
2hours later an exception followed by a sudden reboot w/o warning. Next try,
immediate reboot.
Also bad too.

During bzImage compilation, ld returned:
ld.bfd: arch/x86/boot/compressed/head_64.o: warning: relocation in read-only
section `.head.text'
ld.bfd: warning: creating a DT_TEXTREL in object

Is that something suspicious for you?

FIK, I was stuck many times with the following message until I realized
usr/.initramfs_data.cpio.xz.d file were not removed from the directory (sig).

# make                                                                 
  CALL    scripts/checksyscalls.sh
  DESCEND  objtool
  CHK     include/generated/compile.h
usr/Makefile:48: *** motifs de cible multiples. Arrêt.
make: *** [Makefile:1041: usr] Error 2

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (54 preceding siblings ...)
  2018-11-27 12:24 ` bugzilla-daemon
@ 2018-11-27 17:02 ` bugzilla-daemon
  2018-11-27 21:54 ` bugzilla-daemon
                   ` (214 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-27 17:02 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #56 from Rainer Fiebig (jrf@mailbox.org) ---
FWIW: I've installed a defconfig-4.19.3 in a VirtualBox-VM. But our bug hasn't
shown up so far.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (55 preceding siblings ...)
  2018-11-27 17:02 ` bugzilla-daemon
@ 2018-11-27 21:54 ` bugzilla-daemon
  2018-11-28  0:06 ` bugzilla-daemon
                   ` (213 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-27 21:54 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

Guenter Roeck (linux@roeck-us.net) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |linux@roeck-us.net

--- Comment #57 from Guenter Roeck (linux@roeck-us.net) ---
I have seen the problem on two of four systems running v4.19.4. All systems are 

System 1:
  MSI B450 TOMAHAWK (MS-7C02)
  Ryzen 2700X
  Drive 1: NVME (500GB)
  Drive 2: SATA HDD (WD4001FAEX-00MJRA0, 4TB)
  Problem seen on SATA HDD, with both 4.19.3 and 4.19.4

System 2:
  MSI B350M MORTAR (MS-7A37)
  Ryzen 1700X
  Drive 1: SSD Samsung SSD 840 PRO 250GB
  Drive 2: SSD Samsung SSD 840 EVO 250GB
  Problem seen on both drives, with both 4.19.3 and 4.19.4

System 3:
  Gigabyte AB350M-Gaming 3
  Ryzen 1700X
  Drive 1: SSD Samsung SSD 840 PRO 250GB
  Drive 2: SSD M4-CT256M4SSD2 (250GB)
  Problem not seen (yet)

System 4:
  MSI B350M MORTAR (MS-7A37)
  Ryzen 1700X
  Drive 1: NVME (500GB)
  Problem not seen (yet)

Default configuration was CONFIG_SCSI_MQ_DEFAULT=y. I tried with
CONFIG_SCSI_MQ_DEFAULT=n on system 2 (with 4.19.4) and hit the problem again
almost immediately.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (56 preceding siblings ...)
  2018-11-27 21:54 ` bugzilla-daemon
@ 2018-11-28  0:06 ` bugzilla-daemon
  2018-11-28  5:05 ` bugzilla-daemon
                   ` (212 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-28  0:06 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #58 from Henrique Rodrigues (henrique.rodrigues@ist.utl.pt) ---
Created attachment 279685
  --> https://bugzilla.kernel.org/attachment.cgi?id=279685&action=edit
debug info

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (57 preceding siblings ...)
  2018-11-28  0:06 ` bugzilla-daemon
@ 2018-11-28  5:05 ` bugzilla-daemon
  2018-11-28  5:06 ` bugzilla-daemon
                   ` (211 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-28  5:05 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #59 from Theodore Tso (tytso@mit.edu) ---
Created attachment 279687
  --> https://bugzilla.kernel.org/attachment.cgi?id=279687&action=edit
Ext4 from 4.18

I'm pretty sure the problem is not in the ext4 changes between 4.18 and 4.19,
since the changes are all quite innocuous (and if it was in the ext4 code, the
regression testing really should have picked it up).

But just to rule things out, I've uploaded the contents of fs/ext4 from 4.18.  
I've verified it can be transplanted on top of 4.19 kernel.   Could the people
who are experiencing problems with 4.19 try building a kernel with the 4.18
fs/ext4 directory?   If you still see problems, then the problem has to be
elsewhere.   If you don't, then we can take a closer look at the ext4 changes
(although I'd then be really puzzled why it's only showing up for some folks,
but not others).

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (58 preceding siblings ...)
  2018-11-28  5:05 ` bugzilla-daemon
@ 2018-11-28  5:06 ` bugzilla-daemon
  2018-11-28  5:10 ` bugzilla-daemon
                   ` (210 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-28  5:06 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

Theodore Tso (tytso@mit.edu) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
 Attachment #279687|Ext4 from 4.18              |Ext4 from 4.18 (tar.gz)
        description|                            |

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (59 preceding siblings ...)
  2018-11-28  5:06 ` bugzilla-daemon
@ 2018-11-28  5:10 ` bugzilla-daemon
  2018-11-28  8:30 ` bugzilla-daemon
                   ` (209 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-28  5:10 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #60 from Theodore Tso (tytso@mit.edu) ---
Henrique -- what is dm-0?  How is it configured?   And are you using discard
(either the mount option, or fstrim)?   Thanks!!

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (60 preceding siblings ...)
  2018-11-28  5:10 ` bugzilla-daemon
@ 2018-11-28  8:30 ` bugzilla-daemon
  2018-11-28 14:40 ` bugzilla-daemon
                   ` (208 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-28  8:30 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #61 from Henrique Rodrigues (henrique.rodrigues@ist.utl.pt) ---
(In reply to Theodore Tso from comment #60)
> Henrique -- what is dm-0?  How is it configured?   And are you using discard
> (either the mount option, or fstrim)?   Thanks!!

dm-o is a LUKS encrypted partition that I use as /. I have fstrim running
weekly with "fstrim -av" (Ubuntu's default).

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (61 preceding siblings ...)
  2018-11-28  8:30 ` bugzilla-daemon
@ 2018-11-28 14:40 ` bugzilla-daemon
  2018-11-28 15:09 ` bugzilla-daemon
                   ` (207 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-28 14:40 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #62 from Jimmy.Jazz@gmx.net ---
If you don't mind, I'll continue bisecting.

But generating a kernel becomes harder with the genuine kernel. Three times a
raw I was unable to compile the kernel. It fails with,
arch/x86/entry/vdso/vclock_gettime-x32.o:vclock_gettime.c:fonction__vdso_gettimeofday
: erreur : débordement de relocalisation : référence à «
vvar_vsyscall_gtod_data »

If it fails again, I will need to patch it:
--- arch/x86/entry/vdso/Makefile~ 2016-10-02 23:24:33.000000000 +0000
+++ arch/x86/entry/vdso/Makefile  2016-11-16 09:35:13.406216597 +0000
@@ -97,6 +97,7 @@

 CPPFLAGS_vdsox32.lds = $(CPPFLAGS_vdso.lds)
 VDSO_LDFLAGS_vdsox32.lds = -Wl,-m,elf32_x86_64 \
+        -fuse-ld=bfd \
         -Wl,-soname=linux-vdso.so.1 \
         -Wl,-z,max-page-size=4096 \
         -Wl,-z,common-page-size=4096

ld gold is my default.

And sorry, I should have patched l1tf_vmx_mitigation too. My mistake.

diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 27830880e7a7..cb4a16292aa7 100644
--- arch/x86/kernel/cpu/bugs.c
+++ arch/x86/kernel/cpu/bugs.c
@@ -664,10 +664,9 @@ void x86_spec_ctrl_setup_ap(void)
 enum l1tf_mitigations l1tf_mitigation __ro_after_init = L1TF_MITIGATION_FLUSH;
 #if IS_ENABLED(CONFIG_KVM_INTEL)
 EXPORT_SYMBOL_GPL(l1tf_mitigation);
-
+#endif
 enum vmx_l1d_flush_state l1tf_vmx_mitigation = VMENTER_L1D_FLUSH_AUTO;
 EXPORT_SYMBOL_GPL(l1tf_vmx_mitigation);
-#endif

 static void __init l1tf_select_mitigation(void)
 {


Neverless the ext4 issue as it seems to be doesn't make sens. I can compile
packages during the test to maintain the cpu's activity on top to ease
reproducing the issue.
Each time I do a reboot, I do a fsck on the ext4 partitions (in both rescue
mode and normal init process) and it's like for some partitions e2fsck is
unable to handle (in any undetermined circumstances) 'Structure needs cleaning'
issue (remember my remark about fsck -D). If that's confirmed, a corrupt fs
could  still be corrupt on the next reboot and misguide us.

In that case, Jens Axboe 4.19.4 patch does its work. I'm bisecting the kernel
on a 4.19.4 patched kernel version. The only fs that's stay corrupt after each
reboot is my backup partition (sig).
Could someone investigate in that direction please ?

I'm using e2fsprogs 1.44.4 package.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply related	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (62 preceding siblings ...)
  2018-11-28 14:40 ` bugzilla-daemon
@ 2018-11-28 15:09 ` bugzilla-daemon
  2018-11-28 15:17 ` bugzilla-daemon
                   ` (206 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-28 15:09 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #63 from Michel Dänzer (michel@daenzer.net) ---
(In reply to Jason Gambrel from comment #3)
> No Raid. 500gb SSD. 16gb ram with a 4gb swap file (not a swap partition).

FWIW, I was running into ext4 metadata corruption every few days with 4.19
using swap files (on the ext4 / on LVM on LUKS). On a hunch, switched to a swap
partition on LVM on LUKS two weeks ago, and haven't run into it since. Swap
files were working fine with pre-4.19 kernels. In case it matters, I run fstrim
in a weekly cronjob, with discard enabled in /etc/lvm/lvm.conf and
/etc/crypttab.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (63 preceding siblings ...)
  2018-11-28 15:09 ` bugzilla-daemon
@ 2018-11-28 15:17 ` bugzilla-daemon
  2018-11-28 16:26 ` bugzilla-daemon
                   ` (205 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-28 15:17 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #64 from Jens Axboe (axboe@kernel.dk) ---
(In reply to Jimmy.Jazz from comment #62)
> In that case, Jens Axboe 4.19.4 patch does its work. I'm bisecting the
> kernel on a 4.19.4 patched kernel version. The only fs that's stay corrupt
> after each reboot is my backup partition (sig).
> Could someone investigate in that direction please ?

How certain are you that my 4.19 patch fixes the issue completely for you? If
100%, can you also try with 4.19.4 + just the first hunk of that patch? In
other words, only apply the part to block/blk-core.c, not the one to
block/blk-mq.c

Thanks!

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (64 preceding siblings ...)
  2018-11-28 15:17 ` bugzilla-daemon
@ 2018-11-28 16:26 ` bugzilla-daemon
  2018-11-28 17:28 ` bugzilla-daemon
                   ` (204 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-28 16:26 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

Néstor A. Marchesini (nestorm_des@hotmail.com) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |nestorm_des@hotmail.com

--- Comment #65 from Néstor A. Marchesini (nestorm_des@hotmail.com) ---
Hi.
My distro is gentoo testing, I also use 4 partitions raid1 ,are two discs of
1T WD black that have never failed.
These raid1 partitions use mdadm with metadata 0.90.

$ lsblk /dev/md*
RM RO MODEL NAME LABEL      FSTYPE MOUNTPOINT   SIZE PHY-SEC LOG-SEC MODE      
 0  0       md0  GentooBoot ext4                128M     512     512 brw-rw----
 0  0       md1  GentooSwap swap   [SWAP]         4G     512     512 brw-rw----
 0  0       md2  GentooRaiz ext4   /             50G     512     512 brw-rw----
 0  0       md3  GentooHome ext4   /home      877,4G     512     512 brw-rw----

Effectively from 4.19.0 I started to have problems with the boot, the system
always closed perfectly unmount all the partitions, but when booting the next
time I fall in fsck and end in recovery console, ignoring these errors, restart
again and I choose the kernel 4.18.20 and it does not fall in fsck, it also
does
not detect any error in the ext4 partitions.
Sometimes these errors trigger the resynchronization of the partition that fsck
detects false positives, I see it with $ cat /proc/mdstat
For now I will continue using 4.18.20, the faults I have been doing since
4.19.0
4.19.1 4.19.2 4.19.3 4.19.4 and 4.19.5, given that this is something from the
4.19.x branch

$ uname -a
Linux pc-user 4.18.20-gentoo #1 SMP PREEMPT Sat Nov 24 14:39:41

$ eselect kernel list 
Available kernel symlink targets:
  [1]   linux-4.18.20-gentoo
  [2]   linux-4.19.4-gentoo
  [3]   linux-4.19.5-gentoo *

Regards

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (65 preceding siblings ...)
  2018-11-28 16:26 ` bugzilla-daemon
@ 2018-11-28 17:28 ` bugzilla-daemon
  2018-11-28 20:42 ` bugzilla-daemon
                   ` (203 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-28 17:28 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #66 from Jimmy.Jazz@gmx.net ---
(In reply to Jens Axboe from comment #64)

> How certain are you that my 4.19 patch fixes the issue completely for you?

Without your patch the failure was mostly systematic in the time.
Synchronization mechanism is not trivial anyway. But statically there is hope.

> If 100%, can you also try with 4.19.4 + just the first hunk of that patch?

> In other words, only apply the part to block/blk-core.c, not the one to
> block/blk-mq.c

I understand, probably syncronize_rcu was a bit too much :). Let me 24h please.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (66 preceding siblings ...)
  2018-11-28 17:28 ` bugzilla-daemon
@ 2018-11-28 20:42 ` bugzilla-daemon
  2018-11-28 22:47 ` bugzilla-daemon
                   ` (202 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-28 20:42 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

Laurent Bonnaud (L.Bonnaud@laposte.net) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |L.Bonnaud@laposte.net

--- Comment #67 from Laurent Bonnaud (L.Bonnaud@laposte.net) ---
I am also experiencing ext4 corruptions with 4.19.x kernels.

One way to trigger this bug that works almost every time on my system is to
backup the whole FS with BorgBackup using this command:

nice -19 ionice -c3 borg create -v --stats --list --filter=AME
--one-file-system --exclude-caches --compression zstd --progress
my-server:/borg-backup::'{hostname}-{now:%Y-%m-%d_%H:%M}' /

Here are kernel messages:

[  916.082499] EXT4-fs error (device sda1): ext4_iget:4831: inode #6318098:
comm borg: bad extra_isize 35466 (inode size 256)
[  916.093908] Aborting journal on device sda1-8.
[  916.096417] EXT4-fs (sda1): Remounting filesystem read-only
[  916.096799] EXT4-fs error (device sda1): ext4_iget:4831: inode #6318101:
comm borg: bad extra_isize 35466 (inode size 256)
[  916.101544] EXT4-fs error (device sda1): ext4_iget:4831: inode #6318103:
comm borg: bad extra_isize 35466 (inode size 256)
[  916.106531] EXT4-fs error (device sda1): ext4_iget:4831: inode #6318107:
comm borg: bad extra_isize 35466 (inode size 256)
[  916.111039] EXT4-fs error (device sda1): ext4_iget:4831: inode #6318110:
comm borg: bad extra_isize 35466 (inode size 256)
[  916.115763] EXT4-fs error (device sda1): ext4_iget:4831: inode #6318112:
comm borg: bad extra_isize 35466 (inode size 256)

If there is some interest, I can provide more details, but in another bug
report since this one is already loaded with attached files.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (67 preceding siblings ...)
  2018-11-28 20:42 ` bugzilla-daemon
@ 2018-11-28 22:47 ` bugzilla-daemon
  2018-11-29  3:20 ` bugzilla-daemon
                   ` (201 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-28 22:47 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #68 from Jimmy.Jazz@gmx.net ---
# uname -a
Linux seal 4.18.0-rc1-radeon-00048-ge1333462e3-dirty #36 SMP PREEMPT Wed Nov 28
18:30:01 CET 2018 x86_64 AMD A10-5800K APU with Radeon(tm) HD Graphics
AuthenticAMD GNU/Linux

I finally run a promising kernel that compiles, doesn't crash and cares about
my filesystems.
4.18.0-rc1-radeon-00048-ge1333462e3-dirty could be a winner.

This time the backup file system dm-4 could be efficiently cured and dirvish
has done its work has expected.
I could compile with it the kernel 4.19.4 as J.Axboe asked me to.

@T.Tso, if you still have an interest in dmesg, fsck (quite impressive) output
with that kernel version, let me know.
Actually, the interaction between e2fsck 1.44.4 and kernel 4.18 differs from
4.19

An dmesg excerpt,
[12421.017028] EXT4-fs warning (device dm-4): kmmpd:191: kmmpd being stopped
since filesystem has been remounted as readonly.
[12434.457445] EXT4-fs warning (device dm-4): ext4_multi_mount_protect:325: MMP
interval 42 higher than expected, please wait.
The warning didn't show off with kernel 4.19 and remount is slower. No ext4
errors to see.

# git bisect log
git bisect start
# good: [94710cac0ef4ee177a63b5227664b38c95bbf703] Linux 4.18
git bisect good 94710cac0ef4ee177a63b5227664b38c95bbf703
# bad: [9ff01193a20d391e8dbce4403dd5ef87c7eaaca6] Linux 4.20-rc3
git bisect bad 9ff01193a20d391e8dbce4403dd5ef87c7eaaca6
# bad: [9ff01193a20d391e8dbce4403dd5ef87c7eaaca6] Linux 4.20-rc3
git bisect bad 9ff01193a20d391e8dbce4403dd5ef87c7eaaca6
# bad: [84df9525b0c27f3ebc2ebb1864fa62a97fdedb7d] Linux 4.19
git bisect bad 84df9525b0c27f3ebc2ebb1864fa62a97fdedb7d
# bad: [f48097d294d6f76a38bf1a1cb579aa99ede44297] dt-bindings: display:
renesas: du: Document r8a77990 bindings
git bisect bad f48097d294d6f76a38bf1a1cb579aa99ede44297
# bad: [f48097d294d6f76a38bf1a1cb579aa99ede44297] dt-bindings: display:
renesas: du: Document r8a77990 bindings
git bisect bad f48097d294d6f76a38bf1a1cb579aa99ede44297
# bad: [54dbe75bbf1e189982516de179147208e90b5e45] Merge tag
'drm-next-2018-08-15' of git://anongit.freedesktop.org/drm/drm
git bisect bad 54dbe75bbf1e189982516de179147208e90b5e45
# bad: [0a957467c5fd46142bc9c52758ffc552d4c5e2f7] x86: i8259: Add missing
include file
git bisect bad 0a957467c5fd46142bc9c52758ffc552d4c5e2f7
# bad: [958f338e96f874a0d29442396d6adf9c1e17aa2d] Merge branch 'l1tf-final' of
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect bad 958f338e96f874a0d29442396d6adf9c1e17aa2d
# bad: [85a0b791bc17f7a49280b33e2905d109c062a47b] Merge branch 'for-linus' of
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
git bisect bad 85a0b791bc17f7a49280b33e2905d109c062a47b
# bad: [8603596a327c978534f5c45db135e6c36b4b1425] Merge branch
'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect bad 8603596a327c978534f5c45db135e6c36b4b1425
# bad: [2406fb8d94fb17fee3ace0c09427c08825eacb16] Merge branch
'sched-urgent-for-linus' of
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect bad 2406fb8d94fb17fee3ace0c09427c08825eacb16
# bad: [cd23ac8ddb7be993f88bee893b89a8b4971c3651] rcu: Add comment to the last
sleep in the rcu tasks loop
git bisect bad cd23ac8ddb7be993f88bee893b89a8b4971c3651

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (68 preceding siblings ...)
  2018-11-28 22:47 ` bugzilla-daemon
@ 2018-11-29  3:20 ` bugzilla-daemon
  2018-11-29  4:48 ` bugzilla-daemon
                   ` (200 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-29  3:20 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #69 from Theodore Tso (tytso@mit.edu) ---
Hi Jimmy,  how certain are you that e1333462e3 is stable for you?    i.e., how
long have you been running with that kernel and how quickly do your other git
bisect bad build fail for you?

And I assume you have run a forced fsck (ideally while 4.18 is booted) on the
file system before installing each kernel that you were bisect testing, right? 
  Otherwise it's possible that a previous bad kernel had left the file system
corrupted, and so a particular kernel stumbled on a corruption, but it wasn't
actually *caused* by that kernel.

The reason why I'm asking these question is that based on your bisect, it would
*appear* that the problem was introduced by an RCU change.  If you look at the
output of "git log --oneline e1333462e3..cd23ac8ddb7" all of the changes are
RCU related.   That's a bit surprising, since given that only some users are
seeing this problem.  If there was a regression was introduced in the RCU
subsystem, I would have expected a large number of people would have been
complaining, with many more bugs than just in ext4.

And there is some evidence that your file system has gotten corrupted.  The
warnings you report here:

[12421.017028] EXT4-fs warning (device dm-4): kmmpd:191: kmmpd being stopped
since filesystem has been remounted as readonly.
[12434.457445] EXT4-fs warning (device dm-4): ext4_multi_mount_protect:325: MMP 
interval 42 higher than expected, please wait.

Are caused by the MMP feature being enabled on your kernel.  It's not enabled
by default, and unless you have relatively exotic hardware (e.g., dual-attached
SCSI disks that can be reached by two servers for failover) there is no reason
to turn on the MMP feature.    You can disable it via:  "tune2fs -O ^mmp
/dev/dm-4".   (And you can enable it via "tune2fs -O mmp /dev/dm-4".)    So
apparently while you were running your tests, the superblock had at least one
bit (the MMP feature bit) flipped by a rogue kernel.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (69 preceding siblings ...)
  2018-11-29  3:20 ` bugzilla-daemon
@ 2018-11-29  4:48 ` bugzilla-daemon
  2018-11-29 11:12 ` bugzilla-daemon
                   ` (199 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-29  4:48 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #70 from Néstor A. Marchesini (nestorm_des@hotmail.com) ---
(In reply to Theodore Tso from comment #59)
> 
> But just to rule things out, I've uploaded the contents of fs/ext4 from
> 4.18.   I've verified it can be transplanted on top of 4.19 kernel.   Could
> the people who are experiencing problems with 4.19 try building a kernel
> with the 4.18 fs/ext4 directory?   If you still see problems, then the
> problem has to be elsewhere.   If you don't, then we can take a closer look
> at the ext4 changes (although I'd then be really puzzled why it's only
> showing up for some folks, but not others).
>

I copied /fs/ext4 from tree 4.18.20 to tree 4.19.5
and compile everything from scratch the tree 4.19.5.
Well, now we'll have to wait and cross our fingers every time I restart the PC.
So far I had no problems, if they appear I would be posted again with data.
Regarding my configuration of CONFIG_SCSI_MQ_DEFAULT it was always enabled for
eons.

# cat /boot/config-4.18.20-gentoo |grep CONFIG_SCSI_MQ_DEFAULT=
CONFIG_SCSI_MQ_DEFAULT=y
# cat /boot/config-4.19.4-gentoo |grep CONFIG_SCSI_MQ_DEFAULT=
CONFIG_SCSI_MQ_DEFAULT=y
# cat /boot/config-4.19.5-gentoo |grep CONFIG_SCSI_MQ_DEFAULT=
CONFIG_SCSI_MQ_DEFAULT=y

# eix -Ic e2fsprogs
[I] sys-fs/e2fsprogs (1.44.4@07/11/18): Standard EXT2/EXT3/EXT4 filesystem
utilities
[I] sys-libs/e2fsprogs-libs (1.44.4@06/11/18): e2fsprogs libraries (common
error and subsystem)
Found 2 matches

Regards

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (70 preceding siblings ...)
  2018-11-29  4:48 ` bugzilla-daemon
@ 2018-11-29 11:12 ` bugzilla-daemon
  2018-11-29 16:32 ` bugzilla-daemon
                   ` (198 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-29 11:12 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

Ortwin Glück (odi@odi.ch) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |odi@odi.ch

--- Comment #71 from Ortwin Glück (odi@odi.ch) ---
If it helps, I do NOT see this bug and I've run all 4.18.y and 4.19.y kernels:
CONFIG_SCSI_MQ_DEFAULT=y
CONFIG_MQ_IOSCHED_DEADLINE=y

rootfs on RAID-0 on 2 SSDs:
cat /proc/mdstat 
Personalities : [raid0] 
md127 : active raid0 sdb1[1] sda3[0]
      499341824 blocks super 1.2 256k chunks

/dev/md127 on / type ext4 (rw,noatime,discard,stripe=128)

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (71 preceding siblings ...)
  2018-11-29 11:12 ` bugzilla-daemon
@ 2018-11-29 16:32 ` bugzilla-daemon
  2018-11-29 16:34 ` bugzilla-daemon
                   ` (197 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-29 16:32 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #72 from Jimmy.Jazz@gmx.net ---
(In reply to Theodore Tso from comment #69)
I didn't trust the kernel enough to let it work all the night without close
observation (i.e I need some rest).

In comparison with the latest tests, I feel certain the kernel is good after
one day with parallel running compilations.That's why I postponed J.Axboe
request.

Actually, I'm working with 4.18 e1333462e3 and after three clean reboot, the
disks stayed clean.

Dirvish is running today and nothing bad has append. I can say 4.18 e1333462e3
is good.
$ uptime
 17:12:44 up  3:23,  6 users,  load average: 10,54, 10,99, 10,13

Also, I didn't change my .config except when asked during the current commit. 

> how quickly do your other git bisect bad build fail ?

The builds failed after I solicit the kernel or when I back up the system
(dirvish/rsync). When the activity is low I didn't observe anything suspicious. 

Also, the server is not a stupid idle beagle.

To resume,

- I jumped to 4.19 because they were no improvement with 4.20-c3... and I
feared for my datas.
- From f48097d2 to 54dbe75b radeon module didn't work (i.e no display)
- 0a957467c5 crashed. Next try, crashed immediately during the boot. (comment
55)
- 958f338e I missed 'l1tf' patch (comment 62)
- From 958f338e to cd23ac8d I missed 'vdso' patch (comment 62)
- e1333462e3 I applied both patches 'l1tf' and 'vdso'

With commit e1333462e3, dm-4 partition could be cleaned efficiently (see
attachement).

> And I assume you have run a forced fsck
I have run a fsck /dev/dm-XX with 4.18 commit e1333462e3 first in rescue mode
than from init script during normal boot. It was not necessary to force an fsck
distinguished from 4.19 and higher releases.

> a previous bad kernel had left the file system corrupted
I thought about it too (comment 62 second paragraph). In that case, why does
only 4.18 + e2fsprogs be able to clean the partitions and not with more recent
kernels ? Doesn't e2fsprogs be compatible with 4.19 branch, does it ?

> git log --oneline e1333462e3..cd23ac8ddb7
I'm using gcc (Gentoo 8.2.0-r4 p1.5) 8.2.0 and use LD=ld.bfd. My linker is gold
by default. Sadly, I didn't find a way to compile it with clang.

> I would have expected a large number of people.
I understand. But race conditions are not always trivial.

> your file system has gotten corrupted.
dm-4 is marked read only until a backup is performed. I add (temporarily) mmp
to the file systems because I though I had a multi remount issue at first.
The report what intended to attract your attention on the following; remount,rw
or remount,ro are really slow with 4.18 commit e1333462e3 and the warning has
never appeared in that way on other builds. That was not observed with vanilla
4.18.X. 

Please, I didn't intend to misguide you. Just consider the warning as a false
positive. If the warning show of a rogue kernel, then it is the kernel 4.18 (a
contradiction).

My computers are on ups and I do an fsck on every reboot but force it again
only when an error has been detected. Anyway, corruptions that appear and
disappear all of sudden on the majority of fs with such a frequency is quite
remarkable.

The file systems are now clean over reboots. I propose to test if 4.19.5 kernel
stops showing corruptions. If they stop, it still opens a new question, why was
fsck missing some file system corruptions ?

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (72 preceding siblings ...)
  2018-11-29 16:32 ` bugzilla-daemon
@ 2018-11-29 16:34 ` bugzilla-daemon
  2018-11-29 22:38 ` bugzilla-daemon
                   ` (196 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-29 16:34 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #73 from Jimmy.Jazz@gmx.net ---
Created attachment 279739
  --> https://bugzilla.kernel.org/attachment.cgi?id=279739&action=edit
fsck output kernel 4.18

the fsck has been done in rescue mode w/ 4.18

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (73 preceding siblings ...)
  2018-11-29 16:34 ` bugzilla-daemon
@ 2018-11-29 22:38 ` bugzilla-daemon
  2018-11-29 22:52 ` bugzilla-daemon
                   ` (195 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-29 22:38 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

Michael Orlitzky (michael@orlitzky.com) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |michael@orlitzky.com

--- Comment #74 from Michael Orlitzky (michael@orlitzky.com) ---
(In reply to Laurent Bonnaud from comment #67)
> I am also experiencing ext4 corruptions with 4.19.x kernels.
> 
> One way to trigger this bug that works almost every time on my system is to
> backup the whole FS with BorgBackup using this command:
> 

Ouch, me too. I've already been through two hard drives and a new SATA
controller. I was just about to resign myself to replacing the whole PC.

My system is an older AMD Phenom, with absolutely nothing fancy going on.
Boring spinning disks, no RAID, and exactly the symptom above.

After upgrading to 4.19.0 everything was fine for a week, and then Borg started
reporting these errors. If I boot to a rescue CD and fsck, things go back to
"normal," but then after a few more days I get corruption again. IIRC I skipped
4.19.1 but had the same problem with 4.19.2, and now again on 4.19.3.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (74 preceding siblings ...)
  2018-11-29 22:38 ` bugzilla-daemon
@ 2018-11-29 22:52 ` bugzilla-daemon
  2018-11-30  1:06 ` bugzilla-daemon
                   ` (194 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-29 22:52 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #75 from Jimmy.Jazz@gmx.net ---
(In reply to Jens Axboe from comment #64)
> only apply the part to block/blk-core.c

@T.Tso and J.Axboe
e1333462e3 was not able to compile the 4.19.5 kernel.

Long story, gcc begins to complain of missing elfutils package (it was
installed already). I felt also in an old CONFIG_UNWINDER_ORC bug "Cannot
generate ORC metadata". Compilations begin to fail with a "cannot make
executable" error. As unbelievable it is, the bug was reported recently
(https://lkml.org/lkml/2018/11/5/108).
I'm using dev-libs/elfutils-0.175 and the kernel isn't affected by
https://bugs.gentoo.org/671760

The good news. 4.19.4 kernel with the part to 'block/blk-core.c' of your patch
has compiled 4.19.5. It doesn't show any sign of ext4 corruption.

I'm waiting for the next backup tomorrow.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (75 preceding siblings ...)
  2018-11-29 22:52 ` bugzilla-daemon
@ 2018-11-30  1:06 ` bugzilla-daemon
  2018-11-30  1:15 ` bugzilla-daemon
                   ` (193 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-30  1:06 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #76 from Jimmy.Jazz@gmx.net ---
(In reply to Jimmy.Jazz from comment #75)

> I'm waiting for the next backup tomorrow.

@J.Axboe
No need to wait. ext4 error resurfaced on dm-8 this time. block/blk-core.c
patch doesn't correct the issue.

[ 3774.584797] EXT4-fs error (device dm-8): ext4_iget:4985: inode #1614666:
comm emerge: bad extended attribute block 1

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (76 preceding siblings ...)
  2018-11-30  1:06 ` bugzilla-daemon
@ 2018-11-30  1:15 ` bugzilla-daemon
  2018-11-30  4:10 ` bugzilla-daemon
                   ` (192 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-30  1:15 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #77 from Jens Axboe (axboe@kernel.dk) ---
(In reply to Jimmy.Jazz from comment #76)
> (In reply to Jimmy.Jazz from comment #75)
> 
> > I'm waiting for the next backup tomorrow.
> 
> @J.Axboe
> No need to wait. ext4 error resurfaced on dm-8 this time. block/blk-core.c
> patch doesn't correct the issue.
> 
> [ 3774.584797] EXT4-fs error (device dm-8): ext4_iget:4985: inode #1614666:
> comm emerge: bad extended attribute block 1

Are you still confident the full patch works? It's interesting since that has
RCU, and the other changes point in that direction, too.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (77 preceding siblings ...)
  2018-11-30  1:15 ` bugzilla-daemon
@ 2018-11-30  4:10 ` bugzilla-daemon
  2018-11-30  5:01 ` bugzilla-daemon
                   ` (191 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-30  4:10 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #78 from Néstor A. Marchesini (nestorm_des@hotmail.com) ---
Well, guys, this seems to pass the test, after several reboots every several
hours of use, I had no more corruptions of my four partitions mdadm raid1.
The solution was to delete the ext4 folder from my kernel 4.19.5 and copy the
ext4 folder from my previous kernel 4.18.20 and recompiling the tree 4.19.5.

$ uname -a
Linux pc-user 4.19.5-gentoo #1 SMP PREEMPT Thu Nov 29 00:45:31 2018 x86_64 AMD
FX(tm)-8350 Eight-Core Processor AuthenticAMD GNU/Linux

I was comparing both folders /fs/ext4 of the 4.18.20 and 4.19.5 trees with meld
and there are several modifications, unfortunately it exceeds my knowledge.
At least on this side I affirm that the problem is gone, we will see it happen
with later patches.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (78 preceding siblings ...)
  2018-11-30  4:10 ` bugzilla-daemon
@ 2018-11-30  5:01 ` bugzilla-daemon
  2018-11-30  7:18 ` bugzilla-daemon
                   ` (190 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-30  5:01 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #79 from Hao Wei Tee (angelsl@in04.sg) ---
(In reply to Néstor A. Marchesini from comment #78)
> Well, guys, this seems to pass the test, after several reboots every several
> hours of use, I had no more corruptions of my four partitions mdadm raid1.
> The solution was to delete the ext4 folder from my kernel 4.19.5 and copy
> the ext4 folder from my previous kernel 4.18.20 and recompiling the tree
> 4.19.5.
> 
> $ uname -a
> Linux pc-user 4.19.5-gentoo #1 SMP PREEMPT Thu Nov 29 00:45:31 2018 x86_64
> AMD FX(tm)-8350 Eight-Core Processor AuthenticAMD GNU/Linux
> 
> I was comparing both folders /fs/ext4 of the 4.18.20 and 4.19.5 trees with
> meld and there are several modifications, unfortunately it exceeds my
> knowledge.
> At least on this side I affirm that the problem is gone, we will see it
> happen with later patches.

If this is the case, perhaps you could bisect fs/ext4 between tags v4.18 and
v4.19?

$ git bisect start v4.19 v4.18 -- fs/ext4

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (79 preceding siblings ...)
  2018-11-30  5:01 ` bugzilla-daemon
@ 2018-11-30  7:18 ` bugzilla-daemon
  2018-11-30  7:32 ` bugzilla-daemon
                   ` (189 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-30  7:18 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #80 from Rainer Fiebig (jrf@mailbox.org) ---
(In reply to Néstor A. Marchesini from comment #78)
> Well, guys, this seems to pass the test, after several reboots every several
> hours of use, I had no more corruptions of my four partitions mdadm raid1.
> The solution was to delete the ext4 folder from my kernel 4.19.5 and copy
> the ext4 folder from my previous kernel 4.18.20 and recompiling the tree
> 4.19.5.
> 
> $ uname -a
> Linux pc-user 4.19.5-gentoo #1 SMP PREEMPT Thu Nov 29 00:45:31 2018 x86_64
> AMD FX(tm)-8350 Eight-Core Processor AuthenticAMD GNU/Linux
> 
> I was comparing both folders /fs/ext4 of the 4.18.20 and 4.19.5 trees with
> meld and there are several modifications, unfortunately it exceeds my
> knowledge.
> At least on this side I affirm that the problem is gone, we will see it
> happen with later patches.

If you can bisect it as suggested in comment 79, please mind what Ted Tso has
said in comment 69, para. 2.

So, after you have hit a bad kernel, make sure that your fs is OK and do the
next step (compiling) with a known-as-good-kernel (4.18.20). Otherwise you
might get false negatives (wrong bads).

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (80 preceding siblings ...)
  2018-11-30  7:18 ` bugzilla-daemon
@ 2018-11-30  7:32 ` bugzilla-daemon
  2018-11-30  7:51 ` bugzilla-daemon
                   ` (188 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-30  7:32 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #81 from Rainer Fiebig (jrf@mailbox.org) ---
(In reply to Jens Axboe from comment #77)
> (In reply to Jimmy.Jazz from comment #76)
> > (In reply to Jimmy.Jazz from comment #75)
> > 
> > > I'm waiting for the next backup tomorrow.
> > 
> > @J.Axboe
> > No need to wait. ext4 error resurfaced on dm-8 this time. block/blk-core.c
> > patch doesn't correct the issue.
> > 
> > [ 3774.584797] EXT4-fs error (device dm-8): ext4_iget:4985: inode #1614666:
> > comm emerge: bad extended attribute block 1
> 
> Are you still confident the full patch works? It's interesting since that
> has RCU, and the other changes point in that direction, too.

It looks like the problem may be caused by changes in fs/ext4 (see comment 78).

But I'm wondering why this only affects some (quite a few, though) and not all.
Like others, I'm running 4.19.5 without any problem here, it's just nice.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (81 preceding siblings ...)
  2018-11-30  7:32 ` bugzilla-daemon
@ 2018-11-30  7:51 ` bugzilla-daemon
  2018-11-30  8:43 ` bugzilla-daemon
                   ` (187 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-30  7:51 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #82 from Hao Wei Tee (angelsl@in04.sg) ---
I fear the bug might be caused by some interaction between something new in
fs/ext4 and something new elsewhere... Sounds unlikely, but it's possible.

Since 4.18 ext4 seems to work on 4.19 kernel, maybe it's worth trying 4.19 ext4
on 4.18 kernel (before a bisect), just to make sure the bisect won't lead us
down a false trail?

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (82 preceding siblings ...)
  2018-11-30  7:51 ` bugzilla-daemon
@ 2018-11-30  8:43 ` bugzilla-daemon
  2018-11-30 10:37 ` bugzilla-daemon
                   ` (186 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-30  8:43 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #83 from Rainer Fiebig (jrf@mailbox.org) ---
(In reply to Hao Wei Tee from comment #82)
> I fear the bug might be caused by some interaction between something new in
> fs/ext4 and something new elsewhere... Sounds unlikely, but it's possible.
> 
> Since 4.18 ext4 seems to work on 4.19 kernel, maybe it's worth trying 4.19
> ext4 on 4.18 kernel (before a bisect), just to make sure the bisect won't
> lead us down a false trail?

Interesting idea. But what works in one direction might not necessarily work
the other way round. Personally, I'd rather like to be on the safe side here.
So before doing this it might be wise to here what Ted Tso thinks about it,
just IMO.

And I don't think that bisecting just fs/ext4 would be misleading. If we find a
bad commit there, Ted and others will look at it anyway and will see whether
this alone explains the problems or whether an interaction with something else
would be necessary to make sense of it.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (83 preceding siblings ...)
  2018-11-30  8:43 ` bugzilla-daemon
@ 2018-11-30 10:37 ` bugzilla-daemon
  2018-11-30 11:09 ` bugzilla-daemon
                   ` (185 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-30 10:37 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

Andreas John (himself@derjohn.de) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |himself@derjohn.de

--- Comment #84 from Andreas John (himself@derjohn.de) ---
Hi,
thanks for investigating the issue. I "costed" my some inodes on my ext4 rootfs
, rMBP, SSD, dm-crypt disk. It appeared on 4.19.1 my case.

I just wanted to add that I run btrfs / dmcrypt /samessd on the /home and that
one is not affected by that issue as far as I can tell.

rgds,
j

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (84 preceding siblings ...)
  2018-11-30 10:37 ` bugzilla-daemon
@ 2018-11-30 11:09 ` bugzilla-daemon
  2018-11-30 12:10 ` bugzilla-daemon
                   ` (184 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-30 11:09 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #85 from Hao Wei Tee (angelsl@in04.sg) ---
(In reply to Rainer Fiebig from comment #83)
> Interesting idea. But what works in one direction might not necessarily work
> the other way round.

Exactly my point. We know that (4.19 ext4 and kernel is broken), (4.18 ext4 and
kernel is working), and (4.18 ext4 and 4.19 kernel is working).

If (4.19 ext4 and 4.18 kernel) is broken, then _most likely_ the bug is caused
by something that changed in v4.18..v4.19. If (4.19 ext4 and 4.18 kernel)
*works*, then either the bug is in something else that changed, or there is an
interaction between two changes that happened in v4.18..v4.19.

In any case, bisecting v4.18..v4.19 will probably give us a clue.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (85 preceding siblings ...)
  2018-11-30 11:09 ` bugzilla-daemon
@ 2018-11-30 12:10 ` bugzilla-daemon
  2018-11-30 14:20 ` bugzilla-daemon
                   ` (183 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-30 12:10 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #86 from Rainer Fiebig (jrf@mailbox.org) ---
(In reply to Hao Wei Tee from comment #85)
> (In reply to Rainer Fiebig from comment #83)
> > Interesting idea. But what works in one direction might not necessarily
> work
> > the other way round.
> 
> Exactly my point. We know that (4.19 ext4 and kernel is broken), (4.18 ext4
> and kernel is working), and (4.18 ext4 and 4.19 kernel is working).
> 
> If (4.19 ext4 and 4.18 kernel) is broken, then _most likely_ the bug is
> caused by something that changed in v4.18..v4.19. If (4.19 ext4 and 4.18
> kernel) *works*, then either the bug is in something else that changed, or
> there is an interaction between two changes that happened in v4.18..v4.19.
> 
Sure, I understand this. I would just shy away from recommending this to others
without a nod from higher powers. But of course it's up to Nestor whether he
wants to try this or not.

> In any case, bisecting v4.18..v4.19 will probably give us a clue.
Let's hope for the best. Perhaps bisecting fs/ext4 will provide enough of a
clue already and spare the poor bisecter to have to bisect the whole beast. ;)

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (86 preceding siblings ...)
  2018-11-30 12:10 ` bugzilla-daemon
@ 2018-11-30 14:20 ` bugzilla-daemon
  2018-11-30 15:44 ` bugzilla-daemon
                   ` (182 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-30 14:20 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #87 from Artem S. Tashkinov (aros@gmx.com) ---
Regression testing could be carried out in a VM running on top of a ramdisk
(e.g. tmpfs) to speed up the process.

I guess someone with a decent amount of persistence and spare time could do
that and test each individual commit between 4.18 and 4.19, however that
doesn't guarantee success since the bug might be hardware related and not
reproducible in a virtual environment. Or it might require obscene amounts of
RAM/disk space which would be difficult, if not impossible to reproduce in a
VM.

I for one decided to stay on 4.18.x and not upgrade to any more recent kernels
until the regression is identified and dealt with.

Maybe one day someone will become truly invested in the kernel development
process and we'll have proper QA/QC/unit testing/regression testing/fuzzying,
so that individuals won't have to sacrifice their data and time because kernel
developers are mostly busy with adding new features and usually not really
concerned with performance, security and stability of their code unless they
are pointed at such issues.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (87 preceding siblings ...)
  2018-11-30 14:20 ` bugzilla-daemon
@ 2018-11-30 15:44 ` bugzilla-daemon
  2018-11-30 15:49 ` bugzilla-daemon
                   ` (181 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-30 15:44 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #88 from Jimmy.Jazz@gmx.net ---
(In reply to Jens Axboe from comment #77)

> Are you still confident the full patch works? It's interesting since that
> has RCU, and the other changes point in that direction, too.

4.19.4 full patched is stable. I'm just puzzled in its capability to clean a
failed file system with sys-fs/e2fsprogs-1.44.4.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (88 preceding siblings ...)
  2018-11-30 15:44 ` bugzilla-daemon
@ 2018-11-30 15:49 ` bugzilla-daemon
  2018-11-30 17:08 ` bugzilla-daemon
                   ` (180 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-30 15:49 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #89 from Jimmy.Jazz@gmx.net ---
To all,

Please add a large among of read-write mounted file systems attached to your
test system. That will increase the probability of the failure.
My experience is, the issue doesn't affect a specific mountpoint over and over
but rather a random one.

FYI, I didn't have any issue with one of the tmpfs filesystems installed. You
should take it into consideration when creating your VM test environment.

nilfs is stable.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (89 preceding siblings ...)
  2018-11-30 15:49 ` bugzilla-daemon
@ 2018-11-30 17:08 ` bugzilla-daemon
  2018-11-30 17:22 ` bugzilla-daemon
                   ` (179 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-30 17:08 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #90 from Hao Wei Tee (angelsl@in04.sg) ---
(In reply to Artem S. Tashkinov from comment #87)
> Maybe one day someone will become truly invested in the kernel development
> process and we'll have proper QA/QC/unit testing/regression
> testing/fuzzying

What we have now is not proper? syzkaller bot, Linux Test Project,
kernelci.org, xfstests, and more that I don't know of. Probably more than any
other OS.

I think it's fair to say Linux has by far more configuration options than any
other kernel out there. It's not feasible to test every single possible
combination. Things will slip through the cracks, especially bugs like this one
where clearly there is something wrong, but not everyone is able to reproduce
it at all. Automated tests are going to miss bugs like this.

We're not doing things worse than anyone else. Apple's APFS had major issues.
Microsoft just had a big problem with their Windows 10 1809 rollout.

Anyway, I remember you from another post you made to LKML complaining about
Linux. You really don't like the way Linux is developed. Why do you still use
it?

I digress.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (90 preceding siblings ...)
  2018-11-30 17:08 ` bugzilla-daemon
@ 2018-11-30 17:22 ` bugzilla-daemon
  2018-11-30 17:47 ` bugzilla-daemon
                   ` (178 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-30 17:22 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #91 from Néstor A. Marchesini (nestorm_des@hotmail.com) ---
For me, the problem started with the release of 4.19.0, and looking at the
commits of the tree 4.19.0, I see that many things of ext4 have been changed
... very many I would say.
If you search with ext4 within the list of comnits you will find several and
with very important changes.

https://cdn.kernel.org/pub/linux/kernel/v4.x/ChangeLog-4.19

There are several massive, one of the most important is:

https://github.com/torvalds/linux/commit/c140f8b072d16595c83d4d16a05693e72d9b1973

This weekend I will try with git bisect, but it will be a very time-consuming
task due to the large number of ext4 commits.
I'm still using 4.19.5 with the ext4 folder of 4.18.20. I have not had problems
so far.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (91 preceding siblings ...)
  2018-11-30 17:22 ` bugzilla-daemon
@ 2018-11-30 17:47 ` bugzilla-daemon
  2018-11-30 18:01 ` bugzilla-daemon
                   ` (177 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-30 17:47 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #92 from Hao Wei Tee (angelsl@in04.sg) ---
(In reply to Néstor A. Marchesini from comment #91)
> For me, the problem started with the release of 4.19.0, and looking at the
> commits of the tree 4.19.0, I see that many things of ext4 have been changed
> ... very many I would say.
> If you search with ext4 within the list of comnits you will find several and
> with very important changes.

There are only 32 new commits in fs/ext4 in v4.19 from v4.18. See [1], count
until commit "ext4: fix check to prevent initializing reserved inodes".

[1]:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/fs/ext4?id=v4.19

> There are several massive, one of the most important is:
> 
>
> https://github.com/torvalds/linux/commit/c140f8b072d16595c83d4d16a05693e72d9b1973

This isn't in v4.19? It only got pulled in the v4.20 merge window.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (92 preceding siblings ...)
  2018-11-30 17:47 ` bugzilla-daemon
@ 2018-11-30 18:01 ` bugzilla-daemon
  2018-11-30 18:05 ` bugzilla-daemon
                   ` (176 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-30 18:01 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #93 from Guenter Roeck (linux@roeck-us.net) ---
Most of the ext4 patches in v4.19 have been backported to v4.18.y. Since
v4.18.20 is reported to be stable, it is quite likely that the problem lies
with one or more of the patches which have _not_ been backported. This would be
one of the following patches.

ext4: close race between direct IO and ext4_break_layouts()
ext4: add nonstring annotations to ext4.h
ext4: readpages() should submit IO as read-ahead
dax: remove VM_MIXEDMAP for fsdax and device dax
ext4: remove unneeded variable "err" in ext4_mb_release_inode_pa()
ext4: improve code readability in ext4_iget()
ext4: handle layout changes to pinned DAX mappings
ext4: use swap macro in mext_page_double_lock
ext4: check allocation failure when duplicating "data" in ext4_remount()
ext4: fix warning message in ext4_enable_quotas()
ext4: super: extend timestamps to 40 bits
ext4: use timespec64 for all inode times
ext4: use ktime_get_real_seconds for i_dtime
ext4: use 64-bit timestamps for mmp_time
block: Define and use STAT_READ and STAT_WRITE

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (93 preceding siblings ...)
  2018-11-30 18:01 ` bugzilla-daemon
@ 2018-11-30 18:05 ` bugzilla-daemon
  2018-11-30 18:07 ` bugzilla-daemon
                   ` (175 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-30 18:05 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #94 from Jens Axboe (axboe@kernel.dk) ---
In terms of ext4 changes, it'd be interesting to just revert this one:

commit ac22b46a0b65dbeccbf4d458db95687e825bde90
Author: Jens Axboe <axboe@kernel.dk>
Date:   Fri Aug 17 15:45:42 2018 -0700

    ext4: readpages() should submit IO as read-ahead

as that guy is generally just not trust worthy. In all seriousness, though, it
shouldn't cause issues (or I would not have done it), and we already do this
for readpages in general, but I guess we could have an older bug in ext4 that
depends deeply on read-ahead NOT failing. Not sure how likely that is, Ted can
probably comment on that.

But it's a trivial revert, and it could potentially be implicated.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (94 preceding siblings ...)
  2018-11-30 18:05 ` bugzilla-daemon
@ 2018-11-30 18:07 ` bugzilla-daemon
  2018-11-30 18:40 ` bugzilla-daemon
                   ` (174 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-30 18:07 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #95 from Jens Axboe (axboe@kernel.dk) ---
BTW, if that patch is to blame, then the bug is elsewhere in ext4 as there
should be no way that read-ahead failing should cause corruption.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (95 preceding siblings ...)
  2018-11-30 18:07 ` bugzilla-daemon
@ 2018-11-30 18:40 ` bugzilla-daemon
  2018-11-30 18:45 ` bugzilla-daemon
                   ` (173 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-30 18:40 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #96 from Rainer Fiebig (jrf@mailbox.org) ---
(In reply to Jens Axboe from comment #94)
> In terms of ext4 changes, it'd be interesting to just revert this one:
> 
> commit ac22b46a0b65dbeccbf4d458db95687e825bde90
> Author: Jens Axboe <axboe@kernel.dk>
> Date:   Fri Aug 17 15:45:42 2018 -0700
> 
>     ext4: readpages() should submit IO as read-ahead
> 
> as that guy is generally just not trust worthy. In all seriousness, though,
> it shouldn't cause issues (or I would not have done it), and we already do
> this for readpages in general, but I guess we could have an older bug in
> ext4 that depends deeply on read-ahead NOT failing. Not sure how likely that
> is, Ted can probably comment on that.
> 
> But it's a trivial revert, and it could potentially be implicated.

Jens, could you provide the patch here, so that perhaps Jimmy and Nestor can
revert it on their 4.19.x and tell us what they see? 
Thanks.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (96 preceding siblings ...)
  2018-11-30 18:40 ` bugzilla-daemon
@ 2018-11-30 18:45 ` bugzilla-daemon
  2018-11-30 18:54 ` bugzilla-daemon
                   ` (172 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-30 18:45 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #97 from Guenter Roeck (linux@roeck-us.net) ---
#94 makes me wonder if the problem may be related to
https://lkml.org/lkml/2018/5/21/71. Just wondering, and I may be completely off
track, but that problem is still seen against the mainline kernel.

#96: commit ac22b46a0b65 can be reverted cleanly with "git revert".

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (97 preceding siblings ...)
  2018-11-30 18:45 ` bugzilla-daemon
@ 2018-11-30 18:54 ` bugzilla-daemon
  2018-11-30 19:02 ` bugzilla-daemon
                   ` (171 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-30 18:54 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #98 from Rainer Fiebig (jrf@mailbox.org) ---
(In reply to Guenter Roeck from comment #97)
> #94 makes me wonder if the problem may be related to
> https://lkml.org/lkml/2018/5/21/71. Just wondering, and I may be completely
> off track, but that problem is still seen against the mainline kernel.
> 
> #96: commit ac22b46a0b65 can be reverted cleanly with "git revert".

Yep, but perhaps some people here don't use git and/or have cloned the repo.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (98 preceding siblings ...)
  2018-11-30 18:54 ` bugzilla-daemon
@ 2018-11-30 19:02 ` bugzilla-daemon
  2018-12-01  1:25 ` bugzilla-daemon
                   ` (170 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-11-30 19:02 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #99 from Guenter Roeck (linux@roeck-us.net) ---
#98: Good point.

I am going to give it a try with the following on top of v4.19.5:

Revert "ext4: handle layout changes to pinned DAX mappings"
Revert "dax: remove VM_MIXEDMAP for fsdax and device dax"
Revert "ext4: close race between direct IO and ext4_break_layouts()"
Revert "ext4: improve code readability in ext4_iget()"
Revert "ext4: readpages() should submit IO as read-ahead"

Wild shot, but I figured it may be worth try.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (99 preceding siblings ...)
  2018-11-30 19:02 ` bugzilla-daemon
@ 2018-12-01  1:25 ` bugzilla-daemon
  2018-12-01  2:34 ` bugzilla-daemon
                   ` (169 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-01  1:25 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #100 from Guenter Roeck (linux@roeck-us.net) ---
So far I have been unable to reproduce the problem after reverting the patches
mentioned in #99. I'll now install this kernel on a second previously affected
system. I'll report back tomorrow morning.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (100 preceding siblings ...)
  2018-12-01  1:25 ` bugzilla-daemon
@ 2018-12-01  2:34 ` bugzilla-daemon
  2018-12-01  3:43 ` bugzilla-daemon
                   ` (168 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-01  2:34 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #101 from Theodore Tso (tytso@mit.edu) ---
Guenter, what is your kernel config?   A number of these changes are related to
CONFIG_DAX.    Are you building kernels with or without CONFIG_DAX enabled?

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (101 preceding siblings ...)
  2018-12-01  2:34 ` bugzilla-daemon
@ 2018-12-01  3:43 ` bugzilla-daemon
  2018-12-01  4:00 ` bugzilla-daemon
                   ` (167 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-01  3:43 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #102 from Guenter Roeck (linux@roeck-us.net) ---
Enabled:

$ grep DAX .config
CONFIG_NVDIMM_DAX=y
CONFIG_DAX_DRIVER=y
CONFIG_DAX=y
CONFIG_DEV_DAX=m
CONFIG_DEV_DAX_PMEM=m
CONFIG_FS_DAX=y
CONFIG_FS_DAX_PMD=y

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (102 preceding siblings ...)
  2018-12-01  3:43 ` bugzilla-daemon
@ 2018-12-01  4:00 ` bugzilla-daemon
  2018-12-01  9:25 ` bugzilla-daemon
                   ` (166 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-01  4:00 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #103 from Guenter Roeck (linux@roeck-us.net) ---
It doesn't look like dax is loaded, though. /dev/daxX does not exist on any of
the affected systems, and lsmod doesn't show any dax modules.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (103 preceding siblings ...)
  2018-12-01  4:00 ` bugzilla-daemon
@ 2018-12-01  9:25 ` bugzilla-daemon
  2018-12-01 12:57 ` bugzilla-daemon
                   ` (165 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-01  9:25 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #104 from Rainer Fiebig (jrf@mailbox.org) ---
#101 Both 4.19.x kernels (VM/real HW):

> grep DAX .config
# CONFIG_DAX is not set
# CONFIG_FS_DAX is not set

Both kernels did *not* have the problem.

This may explain why some see the problem and others don't.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (104 preceding siblings ...)
  2018-12-01  9:25 ` bugzilla-daemon
@ 2018-12-01 12:57 ` bugzilla-daemon
  2018-12-01 14:20 ` bugzilla-daemon
                   ` (164 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-01 12:57 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #105 from Andreas John (himself@derjohn.de) ---
FYI: 4.19.6 was just released, the is a DAX fixup inside:

dax: Avoid losing wakeup in dax_lock_mapping_entry

Maybe the courageous testers should also consider that one, if the issue is DAX
related?

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (105 preceding siblings ...)
  2018-12-01 12:57 ` bugzilla-daemon
@ 2018-12-01 14:20 ` bugzilla-daemon
  2018-12-01 14:28 ` bugzilla-daemon
                   ` (163 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-01 14:20 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #106 from Guenter Roeck (linux@roeck-us.net) ---
I have seen the problem again tonight, but I am not sure if I cleaned the
affected file system correctly with an older kernel before I started the test.
I'll keep running with the same reverts for another day.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (106 preceding siblings ...)
  2018-12-01 14:20 ` bugzilla-daemon
@ 2018-12-01 14:28 ` bugzilla-daemon
  2018-12-01 14:52 ` bugzilla-daemon
                   ` (162 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-01 14:28 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #107 from Guenter Roeck (linux@roeck-us.net) ---
#104: Possibly, but it doesn't explain why I see the problem only on two of
four systems, all running the same kernel.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (107 preceding siblings ...)
  2018-12-01 14:28 ` bugzilla-daemon
@ 2018-12-01 14:52 ` bugzilla-daemon
  2018-12-01 15:16 ` bugzilla-daemon
                   ` (161 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-01 14:52 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #108 from Rainer Fiebig (jrf@mailbox.org) ---
(In reply to Guenter Roeck from comment #107)
> #104: Possibly, but it doesn't explain why I see the problem only on two of
> four systems, all running the same kernel.

Right. Bye bye DAX-theory.

I'm wondering by now whether I made a config-mistake somewhere and *that's* why
I don't have the problem. ;)

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (108 preceding siblings ...)
  2018-12-01 14:52 ` bugzilla-daemon
@ 2018-12-01 15:16 ` bugzilla-daemon
  2018-12-01 15:35 ` bugzilla-daemon
                   ` (160 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-01 15:16 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

Marc Koschewski (marc@osknowledge.org) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |marc@osknowledge.org

--- Comment #109 from Marc Koschewski (marc@osknowledge.org) ---
(In reply to Guenter Roeck from comment #107)
> #104: Possibly, but it doesn't explain why I see the problem only on two of
> four systems, all running the same kernel.

Could it be hardware related like ie. blacklisted "trim" for ie. Samsung 850
Pro? Are the 4 machines absoutely equal hardware-wise (at least on the block
layer)? Maybe such a quirk is needed for just another device...

Running ext4 on 4.19.{5,4,3,2,1,0} with not one error with the following setup:

root@marc:~ # lsblk /dev/sda
NAME           MAJ:MIN RM  SIZE RO TYPE  MOUNTPOINT
sda              8:0    0  477G  0 disk  
├─sda1           8:1    0    1G  0 part  /boot
├─sda2           8:2    0    2M  0 part  
├─sda3           8:3    0    2M  0 part  
├─sda4           8:4    0    2M  0 part  
├─sda5           8:5    0    1G  0 part  
├─sda6           8:6    0  408G  0 part  
│ └─crypt-home 254:1    0  408G  0 crypt /home
├─sda7           8:7    0   59G  0 part  /
└─sda8           8:8    0    8G  0 part  
  └─crypt-swap 254:0    0    8G  0 crypt 

root@marc:~ # mount | grep home
/dev/mapper/crypt-home on /home type ext4
(rw,nosuid,noatime,nodiratime,quota,usrquota,grpquota)

root@marc:~ # cryptsetup status crypt-home 
/dev/mapper/crypt-home is active and is in use.
  type:    LUKS1
  cipher:  aes-xts-plain64
  keysize: 512 bits
  key location: dm-crypt
  device:  /dev/sda6
  sector size:  512
  offset:  4096 sectors
  size:    855633920 sectors
  mode:    read/write

root@marc:~ # egrep -i "(ext4|dax)" /boot/config-4.19.5loc64 
CONFIG_DAX=y
# CONFIG_DEV_DAX is not set
CONFIG_EXT4_FS=y
CONFIG_EXT4_USE_FOR_EXT2=y
CONFIG_EXT4_FS_POSIX_ACL=y
CONFIG_EXT4_FS_SECURITY=y
# CONFIG_EXT4_ENCRYPTION is not set
# CONFIG_EXT4_DEBUG is not set
CONFIG_FS_DAX=y

root@marc:~ # parted --list 
Modell: ATA Samsung SSD 860 (scsi)
Festplatte  /dev/sda:  512GB
Sektorgröße (logisch/physisch): 512B/512B
Partitionstabelle: gpt
Disk-Flags: 

Nummer  Anfang  Ende    Größe   Dateisystem  Name  Flags
 1      1049kB  1075MB  1074MB  ext4
 2      1075MB  1077MB  2097kB                     boot, esp
 3      1077MB  1079MB  2097kB
 4      1079MB  1081MB  2097kB
 5      1081MB  2155MB  1074MB  ext4
 6      2155MB  440GB   438GB
 7      440GB   504GB   63,4GB  ext4
 8      504GB   512GB   8518MB
...
...
...
Modell: Linux device-mapper (crypt) (dm)
Festplatte  /dev/mapper/crypt-home:  438GB
Sektorgröße (logisch/physisch): 512B/512B
Partitionstabelle: loop
Disk-Flags: 

Nummer  Anfang  Ende   Größe  Dateisystem  Flags
 1      0,00B   438GB  438GB  ext4

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (109 preceding siblings ...)
  2018-12-01 15:16 ` bugzilla-daemon
@ 2018-12-01 15:35 ` bugzilla-daemon
  2018-12-01 15:39 ` bugzilla-daemon
                   ` (159 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-01 15:35 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #110 from Bart Van Assche (bvanassche@acm.org) ---
(In reply to Marc Koschewski from comment #109)
> Could it be hardware related like ie. blacklisted "trim" for ie. Samsung 850
> Pro? Are the 4 machines absoutely equal hardware-wise (at least on the block
> layer)? Maybe such a quirk is needed for just another device...

Marc, are you using an I/O scheduler? I'm not using an I/O scheduler:

$ cat /sys/block/sda/queue/scheduler 
[none]

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (110 preceding siblings ...)
  2018-12-01 15:35 ` bugzilla-daemon
@ 2018-12-01 15:39 ` bugzilla-daemon
  2018-12-01 18:27 ` bugzilla-daemon
                   ` (158 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-01 15:39 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #111 from Marc Koschewski (marc@osknowledge.org) ---
(In reply to Bart Van Assche from comment #110)
> (In reply to Marc Koschewski from comment #109)
> > Could it be hardware related like ie. blacklisted "trim" for ie. Samsung
> 850
> > Pro? Are the 4 machines absoutely equal hardware-wise (at least on the
> block
> > layer)? Maybe such a quirk is needed for just another device...
> 
> Marc, are you using an I/O scheduler? I'm not using an I/O scheduler:
> 
> $ cat /sys/block/sda/queue/scheduler 
> [none]

I do:

root@marc:~ # cat /sys/block/sda/queue/scheduler 
[mq-deadline] kyber bfq none

might be relevant as well:

root@marc:~ # cat /proc/cmdline 
BOOT_IMAGE=/vmlinuz-4.19.5loc64 root=/dev/sda7 ro init=/sbin/openrc-init
root=PARTUUID=6d19e60a-72a8-ee44-89f4-cc6f85a9436c real_root=/dev/sda7 ro
resume=PARTUUID=fbc25a25-2d09-634d-9e8b-67308f2feddf real_resume=/dev/sda8
acpi_osi=Linux libata.dma=3 libata.noacpi=0 threadirqs rootfstype=ext4
acpi_sleep=s3_bios,s3_beep devtmpfs.mount=0 net.ifnames=0 vmalloc=512M
noautogroup elevator=deadline libata.force=noncq nouveau.noaccel=0
nouveau.nofbaccel=1 nouveau.modeset=1 nouveau.runpm=0 nmi_watchdog=0
i915.modeset=0 cgroup_disable=memory scsi_mod.use_blk_mq=y dm_mod.use_blk_mq=y
vgacon.scrollback_persistent=1 processor.ignore_ppc=1 intel_iommu=off
crashkernel=128M apparmor=1 security=apparmor

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (111 preceding siblings ...)
  2018-12-01 15:39 ` bugzilla-daemon
@ 2018-12-01 18:27 ` bugzilla-daemon
  2018-12-01 19:49 ` bugzilla-daemon
                   ` (157 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-01 18:27 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #112 from Néstor A. Marchesini (nestorm_des@hotmail.com) ---

$ zcat /proc/config.gz |grep DAX
CONFIG_DAX=m
# CONFIG_FS_DAX is not set

DAX I have it as a module, but I've never seen it loaded with lsmod.

$ cat /sys/block/sda/queue/scheduler
[none]

Always use the mounting parameters in my partitions  barrier=1,data=ordered

$ cat /etc/fstab |grep LABEL
LABEL=GentooBoot  /boot    ext4   noatime,noauto,barrier=1,data=ordered    0 2
LABEL=GentooSwap  none     swap   swap  0 0
LABEL=GentooRaiz  /        ext4   noatime,barrier=1,data=ordered   0 1
LABEL=GentooHome  /home    ext4   noatime,barrier=1,data=ordered   0 2

Excellent point given by Guenter Roeck in comment 93
I would have to try to create several trees 4.19.5 and be removed one or two at
a time,
to isolate the fault.
I still use 4.19.5 with the ext4 folder of 4.18.20 and zero problems.

Regards

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (112 preceding siblings ...)
  2018-12-01 18:27 ` bugzilla-daemon
@ 2018-12-01 19:49 ` bugzilla-daemon
  2018-12-01 21:13 ` bugzilla-daemon
                   ` (156 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-01 19:49 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #113 from Rainer Fiebig (jrf@mailbox.org) ---
(In reply to Néstor A. Marchesini from comment #112)
> $ zcat /proc/config.gz |grep DAX
> CONFIG_DAX=m
> # CONFIG_FS_DAX is not set
> 
> DAX I have it as a module, but I've never seen it loaded with lsmod.
> 
> $ cat /sys/block/sda/queue/scheduler
> [none]
> 
> Always use the mounting parameters in my partitions  barrier=1,data=ordered
> 
> $ cat /etc/fstab |grep LABEL
> LABEL=GentooBoot  /boot    ext4   noatime,noauto,barrier=1,data=ordered    0
> 2
> LABEL=GentooSwap  none     swap   swap  0 0
> LABEL=GentooRaiz  /        ext4   noatime,barrier=1,data=ordered   0 1
> LABEL=GentooHome  /home    ext4   noatime,barrier=1,data=ordered   0 2
> 
> Excellent point given by Guenter Roeck in comment 93
> I would have to try to create several trees 4.19.5 and be removed one or two
> at a time,
> to isolate the fault.
> I still use 4.19.5 with the ext4 folder of 4.18.20 and zero problems.
> 
> Regards

Have you given up on your plan to bisect this like suggested in comment 79? 

It would be only 5 steps for those 32 commits. And compile times should be
rather short.

If you know how to reproduce/provoke the errors it could be done within 2 hours
or less.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (113 preceding siblings ...)
  2018-12-01 19:49 ` bugzilla-daemon
@ 2018-12-01 21:13 ` bugzilla-daemon
  2018-12-01 23:44 ` bugzilla-daemon
                   ` (155 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-01 21:13 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #114 from Theodore Tso (tytso@mit.edu) ---
So I've gotten a query off-line about whether I'm still paying attention to
this bug.   The answer is that I'm absolutely paying attention.   The reason
why I haven't commented much is because there's not much else to say, and I'm
still waiting for more information.

On that front --- I am *absolutely* grateful for people who are still trying to
debug this issue, especially when it may be coming at the risk of their data.  
However, one of the challenges is that it's very easy for reports to be either
false positives or false negatives.

False positives come from booting a kernel which might be fine, but the file
system was corrupted from running a previous kernel.   Remember, when you get
an EXT4-fs error report, that's when the kernel discovers the file system
corruption; it doesn't necessarily mean that the currently running kernel is
buggy.   To prevent this false positives, please run "e2fsck -fy /dev/sdXX >
/tmp/log.1" to make sure the file system is clear before rebooting into the new
kernel.  If e2fsck -fy shows any problems that are fixed, please then run "echo
3 > /proc/sys/vm/drop_caches ; e2fsck -fn /dev/sdXX > /tmp/log.2" to make sure
the file system is really clean.

False negatives come from booting a kernel which is buggy, but since this bug
seems to be a bit flakey, you're getting lucky/unlucky enough to such that
after N hours/days, you just haven't tripped over the bug --- or you *have*
tripped over the bug, but the kernel hasn't noticed the problem yet, and so it
hasn't reported the EXT4-fs error yet.     There's not a lot we can do to
absolutely avoid all false negatives, but if you are running a kernel which you
report is OK, and then a day later, it turns out you see corruption, please
don't forget to submit a comment to bugzilla, saying, "my comment in #NNN,
where I said a particular kernel was problem-free; turns out I have seen a
problem with it."


Again, my thanks for trying to figure out what's going on.  Rest assure that
Jens Axboe and I are both paying very close attention.   This bug is a really
scary one, both because of how the severity of its consequences, *and* because
neither of us can reproduce it on our own systems or regression tests --- so we
 are utterly reliant on those people who *can* reproduce the issue to give us
data.   We very much want to make sure this gets fixed ASAP!

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (114 preceding siblings ...)
  2018-12-01 21:13 ` bugzilla-daemon
@ 2018-12-01 23:44 ` bugzilla-daemon
  2018-12-02  0:01 ` bugzilla-daemon
                   ` (154 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-01 23:44 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #115 from Guenter Roeck (linux@roeck-us.net) ---
Still playing. I have now seen the problem several times with the patches per
#99 reverted, I am just not 100% sure if I see false positives.

For those claiming that upstream developers don't care: I for my part do plan
to spend as much time on this as needed to nail down the problem, though I have
to admit that the comment in #87 almost made me quit (and wonder why I spend
time, energy, and money running kerneltests.org).

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (115 preceding siblings ...)
  2018-12-01 23:44 ` bugzilla-daemon
@ 2018-12-02  0:01 ` bugzilla-daemon
  2018-12-02  0:23 ` bugzilla-daemon
                   ` (153 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-02  0:01 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #116 from Michael Orlitzky (michael@orlitzky.com) ---
Here are two other data points, just for the record:

  1. Like comment #65, I've only actually seen this corruption on three
physical 
     disks, and all of them were Western Digital Caviar Blacks. There is
another
     disk in my system -- a different model -- that has been lucky so far; but
     this may be pure chance. My /dev/sda has had multiple problems, but
     /dev/sdb only got corrupted once even though they're the same model.

  2. The corruption for me is occurring in files (contained in directories)
that 
     I haven't touched in a long time. They get backed up -- which means that
     they get read -- but few if any have been written recently. In addition,
     all of my mounts are "noatime." Normally I wouldn't expect corruption
     from *reading* files, which is what lead me to start swapping out disks
     and SATA controllers.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (116 preceding siblings ...)
  2018-12-02  0:01 ` bugzilla-daemon
@ 2018-12-02  0:23 ` bugzilla-daemon
  2018-12-02  0:37 ` bugzilla-daemon
                   ` (152 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-02  0:23 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #117 from Jens Axboe (axboe@kernel.dk) ---
(In reply to Artem S. Tashkinov from comment #87)
> Regression testing could be carried out in a VM running on top of a ramdisk
> (e.g. tmpfs) to speed up the process.
> 
> I guess someone with a decent amount of persistence and spare time could do
> that and test each individual commit between 4.18 and 4.19, however that
> doesn't guarantee success since the bug might be hardware related and not
> reproducible in a virtual environment. Or it might require obscene amounts
> of RAM/disk space which would be difficult, if not impossible to reproduce
> in a VM.
> 
> I for one decided to stay on 4.18.x and not upgrade to any more recent
> kernels until the regression is identified and dealt with.
> 
> Maybe one day someone will become truly invested in the kernel development
> process and we'll have proper QA/QC/unit testing/regression
> testing/fuzzying, so that individuals won't have to sacrifice their data and
> time because kernel developers are mostly busy with adding new features and
> usually not really concerned with performance, security and stability of
> their code unless they are pointed at such issues.

You obviously have no idea wtf you are talking about, I suggest you go
investigate just how much testing is done, continuously, on things like file
system and storage. I take personal offense in in claims that developers "are
not really concerned with performance and stability of their code".

Here's a news flash for you - bugs happen, no matter how much testing is done
on something.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (117 preceding siblings ...)
  2018-12-02  0:23 ` bugzilla-daemon
@ 2018-12-02  0:37 ` bugzilla-daemon
  2018-12-02  0:44 ` bugzilla-daemon
                   ` (151 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-02  0:37 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

James Courtier-Dutton (James@superbug.co.uk) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |James@superbug.co.uk

--- Comment #118 from James Courtier-Dutton (James@superbug.co.uk) ---
I have not observed the problem, but I have been thinking of maybe a more
reliable way to detect a problem.
btrfs has a "scrub" command that essentially verifies the checksum of every
file on the disk.
Now, ext4 does not have such a feature (as far as I know).
How about people who are seeing this problem, do a recursive sha1sum -b of
every file on the disk while in a known good state, and then do a sha1sum -c 
of every file on the disk to see which ones got corrupted.
This might help when doing git bisect and checking that we are back to a known
good file system, and in cases like comment #116, item 2.

Also, I think there is a way to force a reboot to a particular kernel, using
grub, so one could script and git bisect, reboot to old working kernel, fsck,
then reboot to problem kernel and start next git bisect all using automated
scripts.

Anyway, just ideas.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (118 preceding siblings ...)
  2018-12-02  0:37 ` bugzilla-daemon
@ 2018-12-02  0:44 ` bugzilla-daemon
  2018-12-02  0:48 ` bugzilla-daemon
                   ` (150 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-02  0:44 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #119 from Guenter Roeck (linux@roeck-us.net) ---
I think I need some education. It has been suggested suggested several times -
both here and elsewhere on the web - that the problem might possibly be caused
by bad drives. Yet, I don't recall a single instance where a disk error was
reported in conjunction with this problem. I most definitely don't see one on
my systems.

Can hard drives and SSDs nowadays fail silently by reading bad data instead of
reporting read (and/or write) errors ? I would find that thought quite scary.
Can someone point me to related literature ?

In this context, it seems odd that this presumed silent disk error would only
show up with v4.19.x, but not with earlier kernels.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (119 preceding siblings ...)
  2018-12-02  0:44 ` bugzilla-daemon
@ 2018-12-02  0:48 ` bugzilla-daemon
  2018-12-02  0:50 ` bugzilla-daemon
                   ` (149 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-02  0:48 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #120 from Florian Bruhin (kernel.org@the-compiler.org) ---
> How about people who are seeing this problem, do a recursive sha1sum -b of
> every file on the disk while in a known good state, and then do a sha1sum -c 
> of every file on the disk to see which ones got corrupted.

FWIW, https://github.com/claudehohl/checksummer does that (and saves the
checksums in a sqlite database).

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (120 preceding siblings ...)
  2018-12-02  0:48 ` bugzilla-daemon
@ 2018-12-02  0:50 ` bugzilla-daemon
  2018-12-02  0:52 ` bugzilla-daemon
                   ` (148 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-02  0:50 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #121 from Jimmy.Jazz@gmx.net ---
I have the problem again with kernel 4.19.5 (J. Axboe patches). Sorry I don't
trust 4.19.x without Axboe patch because if some fs corruptions reappear they
will be less violent than without the patch.

OS uptime  8:51, 37 static ext4 mountpoints as they are reported by 'mount'
command.

I have checked with the same kernel 4.19.5 in rescue mode (i.e none of the file
systems are mounted)

summary:
- e2fsck -fy /dev/dm-4 > log.1
- echo 3 > /proc/sys/vm/drop-caches
- e2fsck -fn /dev/dm-4 log.2
- reboot again in normal mode ( / is mounted)
- fsck -TRAC -r -M -p  /dev/dm-X || fsck -f -TRAC -y -s /dev/dm-X
- if there is a new stable release then I compile and test the new release

Result:
The system doesn't see any error (fsck) on reboot (rescue and normal boot).

see attachments.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (121 preceding siblings ...)
  2018-12-02  0:50 ` bugzilla-daemon
@ 2018-12-02  0:52 ` bugzilla-daemon
  2018-12-02  0:52 ` bugzilla-daemon
                   ` (147 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-02  0:52 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #122 from Jimmy.Jazz@gmx.net ---
Created attachment 279771
  --> https://bugzilla.kernel.org/attachment.cgi?id=279771&action=edit
dmesg shows errors  before reboot

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (122 preceding siblings ...)
  2018-12-02  0:52 ` bugzilla-daemon
@ 2018-12-02  0:52 ` bugzilla-daemon
  2018-12-02  0:56 ` bugzilla-daemon
                   ` (146 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-02  0:52 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #123 from Jimmy.Jazz@gmx.net ---
Created attachment 279773
  --> https://bugzilla.kernel.org/attachment.cgi?id=279773&action=edit
logs show no error after reboot

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (123 preceding siblings ...)
  2018-12-02  0:52 ` bugzilla-daemon
@ 2018-12-02  0:56 ` bugzilla-daemon
  2018-12-02  1:25 ` bugzilla-daemon
                   ` (145 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-02  0:56 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #124 from Jimmy.Jazz@gmx.net ---
Please read more violent without the patch

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (124 preceding siblings ...)
  2018-12-02  0:56 ` bugzilla-daemon
@ 2018-12-02  1:25 ` bugzilla-daemon
  2018-12-02  3:36 ` bugzilla-daemon
                   ` (144 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-02  1:25 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #125 from Bart Van Assche (bvanassche@acm.org) ---
(In reply to James Courtier-Dutton from comment #118)
> How about people who are seeing this problem, do a recursive sha1sum -b of
> every file on the disk while in a known good state, and then do a sha1sum -c
> of every file on the disk to see which ones got corrupted.
> This might help when doing git bisect and checking that we are back to a
> known good file system, and in cases like comment #116, item 2.

That could take a lot of CPU time. On my system git status told me that about
ten source files had disappeared from the kernel tree that I definitely had not
deleted myself. In other words, git can be used to detect filesystem
corruption.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (125 preceding siblings ...)
  2018-12-02  1:25 ` bugzilla-daemon
@ 2018-12-02  3:36 ` bugzilla-daemon
  2018-12-02  4:07 ` bugzilla-daemon
                   ` (143 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-02  3:36 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

Eric Benoit (eric@ecks.ca) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |eric@ecks.ca

--- Comment #126 from Eric Benoit (eric@ecks.ca) ---
FWIW I believe this issue is affecting ZFS as well. I'm getting the occasional
checksum error on a random drive of a RAID-Z configuration (five 4T WD Reds). 

I'd initially suspected a chipset (Intel 5400) issue as it's spread more or
less evenly across the devices. However it's definitely a software issue, as it
only occurs running kernel 4.19.[26] and disappears entirely with 4.18.20.

On a different workstation with ZFS and Seagate drives I haven't been able to
reproduce the issue.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (126 preceding siblings ...)
  2018-12-02  3:36 ` bugzilla-daemon
@ 2018-12-02  4:07 ` bugzilla-daemon
  2018-12-02  4:20 ` bugzilla-daemon
                   ` (142 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-02  4:07 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #127 from Theodore Tso (tytso@mit.edu) ---
In reply to #118, from James Courtier-Dutton:

While there have been a few people who have reported problems with the contents
of their files, the vast majority of people are reporting problems that seem to
include complete garbage being written into metadata blocks --- i.e.,
completely garbage in to inode table, block group descriptor, and superblocks. 
  This is getting detected by the kernel noticing corruption, or by e2fsck
running and noticing that the file system metadata is inconsistent.   More
modern ext4 file systems have metadata checksum turned on, but the reports from
e2fsck seem to indicate that complete garbage (or, more likely, data meant for
block XXX is getting written to block YYY); as such, the corruption is not
subtle, so generally the kernel doesn't need checksums to figure out that the
metadata blocks are nonsensical.

It should be noted that ext4 has very strong checks to prevent this from
happening.  In particular, when a inode's logical block number is converted to
a physical block number, there is block_validity checking to make sure that the
physical block number for a data block does not map onto a metadata block. 
This prevents a corrupted extent tree from causing ext4 to try to write data
meant for a data block on top of an inode table block, which would cause the
sort of symptoms that some users have reported.

One possible cause is that something below ext4 (e.g. the block layer, or an
I/O scheduler) is scrambling the block number so that a file write meant for
data block XXX is getting writen to metadata block YYY.   If Eric Benoit's
report in comment #126 is to believed, and he is seeing the same behavior with
ZFS, then that might be consistent with a bug in the block layer.

However, some people who have reported that transplanting ext4 from 4.18 onto
4.19 has caused the problem to go away.  That would be an argument in favor of
the problem being in ext4. 

Of course, both observations might be flawed (see my previous comments about
false positive and negative reports).  And there might be more than one bug
that we are chasing at the moment. 

But the big question which we don't understand is why are some people seeing
it, but not others.   There are a huge number of variables, from kernel
configs, to what I/O scheduler might be selected, etc.    The bug also seems to
be very flaky, and there is some hint that heavy I/O load is required to
trigger the bug.  So it might be that people who think their kernel is fine,
might actually be buggy, because they simply haven't pushed their system hard
enough.  Or it might require heavy workloads of a specific type (e.g., Direct
I/O or Async I/O), or one kind of workload racing with another type of
workload.    This is what makes tracking down this kind of bug really hard.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (127 preceding siblings ...)
  2018-12-02  4:07 ` bugzilla-daemon
@ 2018-12-02  4:20 ` bugzilla-daemon
  2018-12-02 10:20 ` bugzilla-daemon
                   ` (141 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-02  4:20 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #128 from Theodore Tso (tytso@mit.edu) ---
To Guenter, re: #119.

This is just my intuition, but this doesn't "smell" like a problem with a bad
drive.  There are too many reports where people have said that they don't see
the problem with 4.18, but they do see it with 4.19.0 and newer kernels.   The
reports have been with different hardware, from HDD's to SSD's, with some
people reporting NVMe-attached SSD And some reporting SATA-attached SSD's.    

Can hard drives and SSDs nowadays fail silently by reading bad data instead of
reporting read (and/or write) errors?   One of the things I've learned in my
decades of storage experience is to never rule anything out --- hardware will
do some very strange things.  That being said.... no, normally this would be
highly unlikely.   Hard Drive and SSD's have strong error-correcting codes,
parity and super-parity checks in their internal data paths, so silent read
errors are unlikely, unless the firmware is *seriously* screwed up.

In addition, the fact that some of the reports involve complete garbage getting
written into the inode table, it seems more likely the problem is on the
writing side rather than on the read side.

One thing I would recommend is "tune2fs -e remount-ro /dev/sdXX".  This will
set the default mode to remount the file system read-only.  So if the problem
is on the read side, it makes it even more unlikely that the data will be
written back to disk.   Some people may prefer "tune2fs -e panic /dev/sdXX",
especially on production servers.   That way, when the kernel discovers a file
system inconsistency, it will immediately force a reboot, and then the fsck run
on reboot can fix up the problem.  More importantly, by preventing the system
from continuing to operate after a problem has been noticed, it avoids the
problem metastasizing, making things even worse.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (128 preceding siblings ...)
  2018-12-02  4:20 ` bugzilla-daemon
@ 2018-12-02 10:20 ` bugzilla-daemon
  2018-12-02 10:24 ` bugzilla-daemon
                   ` (140 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-02 10:20 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #129 from carlphilippreh@gmail.com ---
(In reply to carlphilippreh from comment #29)
> Sorry for the late response, but I have been trying to reproduce the problem
> with 4.19.2 for some while now. It seems that the problem I was experiencing
> only happens with 4.19.1 and 4.19.0, and it did so very frequently. I can at
> least confirm that I have CONFIG_SCSI_MQ_DEFAULT=y set in 4.19 but I didn't
> in 4.18. I hope that this is, at least for me, fixed for now.

While I wasn't able to reproduce the bug for quite some time, it ended up
coming back. I'm currently running 4.19.6 and I see invalid metadata in files
that I have written using this version.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (129 preceding siblings ...)
  2018-12-02 10:20 ` bugzilla-daemon
@ 2018-12-02 10:24 ` bugzilla-daemon
  2018-12-02 10:25 ` bugzilla-daemon
                   ` (139 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-02 10:24 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #130 from carlphilippreh@gmail.com ---
Just as I was writing this, my third computer running Linux (currently at
4.19.6) is now also running into this issue.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (130 preceding siblings ...)
  2018-12-02 10:24 ` bugzilla-daemon
@ 2018-12-02 10:25 ` bugzilla-daemon
  2018-12-02 10:46 ` bugzilla-daemon
                   ` (138 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-02 10:25 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #131 from Reindl Harald (harry@rhsoft.net) ---
has anybody ever seen that bug within a virtual machine?

i currently run 4.19.x only inside VMs on VMware Workstation / VMware ESXi and
did not see any issues, my only phyiscal test was my homeserver which
completley crahsed 4 times like because of the VMware Workstation 14
kernel-modules lasted only for a weekend (RAID10, 2 Samsung Evo 850 2 TB, 2
Samsung Evo 860 2 TB) after the last crash left a 0 byte "firewall.sh" in the
nested VM i was working for hours

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (131 preceding siblings ...)
  2018-12-02 10:25 ` bugzilla-daemon
@ 2018-12-02 10:46 ` bugzilla-daemon
  2018-12-02 11:36 ` bugzilla-daemon
                   ` (137 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-02 10:46 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #132 from Rainer Fiebig (jrf@mailbox.org) ---
(In reply to carlphilippreh from comment #129)
> While I wasn't able to reproduce the bug for quite some time, it ended up
> coming back. I'm currently running 4.19.6 and I see invalid metadata in
> files that I have written using this version.

I think it could be helpful if you provided your .config(s) here.

And what kernel are you using: self-compiled/from your distribution (which)?
If self-compiled: have you made changes to the .config?

The set-up of the boxes you mentioned in comment 5 seems just right to hunt
this down. ;)

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (132 preceding siblings ...)
  2018-12-02 10:46 ` bugzilla-daemon
@ 2018-12-02 11:36 ` bugzilla-daemon
  2018-12-02 11:36 ` bugzilla-daemon
                   ` (136 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-02 11:36 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #133 from carlphilippreh@gmail.com ---
Created attachment 279779
  --> https://bugzilla.kernel.org/attachment.cgi?id=279779&action=edit
Config of first computer

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (133 preceding siblings ...)
  2018-12-02 11:36 ` bugzilla-daemon
@ 2018-12-02 11:36 ` bugzilla-daemon
  2018-12-02 11:41 ` bugzilla-daemon
                   ` (135 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-02 11:36 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #134 from carlphilippreh@gmail.com ---
Created attachment 279781
  --> https://bugzilla.kernel.org/attachment.cgi?id=279781&action=edit
Config of second computer

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (134 preceding siblings ...)
  2018-12-02 11:36 ` bugzilla-daemon
@ 2018-12-02 11:41 ` bugzilla-daemon
  2018-12-02 11:57 ` bugzilla-daemon
                   ` (134 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-02 11:41 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #135 from jaapbuurman@gmail.com ---
Doesn't the Linux kernel team have any procedures in place for when such a
critical bug is found? There are many people running this "stable" 4.19 branch,
many of whom are unaware of this bug. Shouldn't the stable branch be rolled
back to the last known good version? Going back to 4.18 is certainly a better
option, but people unaware of this bug might still be running 4.19.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (135 preceding siblings ...)
  2018-12-02 11:41 ` bugzilla-daemon
@ 2018-12-02 11:57 ` bugzilla-daemon
  2018-12-02 11:59 ` bugzilla-daemon
                   ` (133 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-02 11:57 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #136 from carlphilippreh@gmail.com ---
(In reply to Rainer Fiebig from comment #132)
> (In reply to carlphilippreh from comment #129)
> > While I wasn't able to reproduce the bug for quite some time, it ended up
> > coming back. I'm currently running 4.19.6 and I see invalid metadata in
> > files that I have written using this version.
> 
> I think it could be helpful if you provided your .config(s) here.
> 
> And what kernel are you using: self-compiled/from your distribution (which)?
> If self-compiled: have you made changes to the .config?
> 
> The set-up of the boxes you mentioned in comment 5 seems just right to hunt
> this down. ;)

I'm configuring the kernels myself. Two things that I always enable and that
_might_ be related are:

CONFIG_BLK_WBT / CONFIG_BLK_WBT_MQ
and
CONFIG_CFQ_GROUP_IOSCHED

Maybe I can come up with a way to reproduce this bug more quickly. Writing a
lot of (small) files and then deleting them seems like a good way so far.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (136 preceding siblings ...)
  2018-12-02 11:57 ` bugzilla-daemon
@ 2018-12-02 11:59 ` bugzilla-daemon
  2018-12-02 12:01 ` bugzilla-daemon
                   ` (132 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-02 11:59 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #137 from Marc Burkhardt (marc@osknowledge.org) ---
(In reply to jaapbuurman from comment #135)
> Doesn't the Linux kernel team have any procedures in place for when such a
> critical bug is found? There are many people running this "stable" 4.19
> branch, many of whom are unaware of this bug. Shouldn't the stable branch be
> rolled back to the last known good version? Going back to 4.18 is certainly
> a better option, but people unaware of this bug might still be running 4.19.

That would mean depublishing of the 4.19 release as a whole as nobody knows
_what_ exactly to roll back. And if one would know, they would fix the bug
instead.

I cannot remember such a scenario/bug in the past...

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (137 preceding siblings ...)
  2018-12-02 11:59 ` bugzilla-daemon
@ 2018-12-02 12:01 ` bugzilla-daemon
  2018-12-02 12:07 ` bugzilla-daemon
                   ` (131 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-02 12:01 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #138 from Reindl Harald (harry@rhsoft.net) ---
at least continue security updates for 4.18.x would probably be a good idea

Fedora 28 is already on 4.19.x

i run now 4.18.20-100.fc27.x86_64 which was the last Fedora 27 update on every
F28 server and until this problem is solved i refuse to run 4.19.x in
production which essentially means no security fixes for a unknown amount of
time

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (138 preceding siblings ...)
  2018-12-02 12:01 ` bugzilla-daemon
@ 2018-12-02 12:07 ` bugzilla-daemon
  2018-12-02 12:27 ` bugzilla-daemon
                   ` (130 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-02 12:07 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #139 from Marc Burkhardt (marc@osknowledge.org) ---
(In reply to carlphilippreh from comment #136)
> (In reply to Rainer Fiebig from comment #132)
> > (In reply to carlphilippreh from comment #129)
> > > While I wasn't able to reproduce the bug for quite some time, it ended up
> > > coming back. I'm currently running 4.19.6 and I see invalid metadata in
> > > files that I have written using this version.
> > 
> > I think it could be helpful if you provided your .config(s) here.
> > 
> > And what kernel are you using: self-compiled/from your distribution
> (which)?
> > If self-compiled: have you made changes to the .config?
> > 
> > The set-up of the boxes you mentioned in comment 5 seems just right to hunt
> > this down. ;)
> 
> I'm configuring the kernels myself. Two things that I always enable and that
> _might_ be related are:
> 
> CONFIG_BLK_WBT / CONFIG_BLK_WBT_MQ
> and
> CONFIG_CFQ_GROUP_IOSCHED
> 
> Maybe I can come up with a way to reproduce this bug more quickly. Writing a
> lot of (small) files and then deleting them seems like a good way so far.

I have these config options set and _currently_ no corruption.

Having this compiled in is *probably* not what to look for. Rather people
should seek for actual *usage* of these features. I use the deadline scheduler.

root@marc:~ # egrep  "(BLK_WBT|IOSCH)" /boot/config-4.19.5loc64 
CONFIG_BLK_WBT=y
CONFIG_BLK_WBT_SQ=y
CONFIG_BLK_WBT_MQ=y
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y
CONFIG_CFQ_GROUP_IOSCHED=y
CONFIG_DEFAULT_IOSCHED="deadline"
CONFIG_MQ_IOSCHED_DEADLINE=y
CONFIG_MQ_IOSCHED_KYBER=y
CONFIG_IOSCHED_BFQ=y
# CONFIG_BFQ_GROUP_IOSCHED is not set


Could someone gather a list of what actually is in .configs but is
relevant/irrelevant? I don't want to do is but I'm not really sure to not mess
is up. I mean there was "DAX enabled in the .config" talked about but I have it
compiled-in but I'm not actually using is.

I would, moreover, like to gather actual setting used by people who run into
the bug and those who are not, like currently used schedulers, nr_requests,
discard, ...

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (139 preceding siblings ...)
  2018-12-02 12:07 ` bugzilla-daemon
@ 2018-12-02 12:27 ` bugzilla-daemon
  2018-12-02 12:38 ` bugzilla-daemon
                   ` (129 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-02 12:27 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #140 from jaapbuurman@gmail.com ---
(In reply to Marc Burkhardt from comment #137)
> (In reply to jaapbuurman from comment #135)
> > Doesn't the Linux kernel team have any procedures in place for when such a
> > critical bug is found? There are many people running this "stable" 4.19
> > branch, many of whom are unaware of this bug. Shouldn't the stable branch
> be
> > rolled back to the last known good version? Going back to 4.18 is certainly
> > a better option, but people unaware of this bug might still be running
> 4.19.
> 
> That would mean depublishing of the 4.19 release as a whole as nobody knows
> _what_ exactly to roll back. And if one would know, they would fix the bug
> instead.
> 
> I cannot remember such a scenario/bug in the past...

I know it sounds bad, but isn't depublishing 4.19 the best course of action
right now? There's probably a lot of people running 4.19 that are completely
unaware of this bug and might or might not run into this later.

Data corruption issues are one of the worst, and should be addressed ASAP, even
if it means temporary depublishing the latest kernel, right?

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (140 preceding siblings ...)
  2018-12-02 12:27 ` bugzilla-daemon
@ 2018-12-02 12:38 ` bugzilla-daemon
  2018-12-02 13:28 ` bugzilla-daemon
                   ` (128 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-02 12:38 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #141 from Marc Burkhardt (marc@osknowledge.org) ---
(In reply to jaapbuurman from comment #140)
> (In reply to Marc Burkhardt from comment #137)
> > (In reply to jaapbuurman from comment #135)
> > > Doesn't the Linux kernel team have any procedures in place for when such
> a
> > > critical bug is found? There are many people running this "stable" 4.19
> > > branch, many of whom are unaware of this bug. Shouldn't the stable branch
> > be
> > > rolled back to the last known good version? Going back to 4.18 is
> certainly
> > > a better option, but people unaware of this bug might still be running
> > 4.19.
> > 
> > That would mean depublishing of the 4.19 release as a whole as nobody knows
> > _what_ exactly to roll back. And if one would know, they would fix the bug
> > instead.
> > 
> > I cannot remember such a scenario/bug in the past...
> 
> I know it sounds bad, but isn't depublishing 4.19 the best course of action
> right now? There's probably a lot of people running 4.19 that are completely
> unaware of this bug and might or might not run into this later.
> 
> Data corruption issues are one of the worst, and should be addressed ASAP,
> even if it means temporary depublishing the latest kernel, right?

4.18.20 is from Nov 21st and came with 4.19.3. It lacks 3 releases of fixes
parallel to 4.19.6 due to 4.18 being EOL.

4.19 is out in the wild now. You cannot "get it back" ...

And people are probably more aware of a new 4.19 release pushed by the distros
than a rollback of the 4.19 release.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (141 preceding siblings ...)
  2018-12-02 12:38 ` bugzilla-daemon
@ 2018-12-02 13:28 ` bugzilla-daemon
  2018-12-02 13:35 ` bugzilla-daemon
                   ` (127 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-02 13:28 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #142 from Rainer Fiebig (jrf@mailbox.org) ---
(In reply to carlphilippreh from comment #134)
> Created attachment 279781 [details]
> Config of second computer

Thanks. It'll take a while to sift through this.

As an alternative to 4.19 you may want to use one of the latest LTS-kernels,
4.14.84 perhaps.[1]
But before compiling/installing it, make sure the fs is OK (s. comment 114).

[1] https://www.kernel.org/

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (142 preceding siblings ...)
  2018-12-02 13:28 ` bugzilla-daemon
@ 2018-12-02 13:35 ` bugzilla-daemon
  2018-12-02 13:43 ` bugzilla-daemon
                   ` (126 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-02 13:35 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

Daniel Harding (dharding@living180.net) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |dharding@living180.net

--- Comment #143 from Daniel Harding (dharding@living180.net) ---
Another datapoint:  I have observed Ext4 metadata corruption under both 4.19.1
and 4.19.4.  I'm using LVM (but no RAID); the underlying drive is a 1GB
SATA-attached Samsung 850 PRO SSD.  I've not been able to reliably reproduce,
but an rsync-based backup of my home partition runs once an hour and it usually
starts reporting corruption errors within a day or two of booting a 4.19.x
kernel.  So far the corruption has only happened in directories that I am not
actively using - as far as I know they are only being accessed by the rsync
process.  Since I started seeing the corruption under 4.19.x, I've run 4.18.16
for two stretches, one of which was twelve days, without any problems, so I'm
quite confident it is not an issue of defective hardware.  I have a weekly cron
job which runs fstrim, but at least once I booted into 4.19.4 (previous boot
was 4.18.16), and started seeing metadata corruption after about 36 hours, but
fstrim had not run during that time.

Some (possibly) relevant kernel configs:
CONFIG_SCSI_MQ_DEFAULT=y
# CONFIG_DM_MQ_DEFAULT is not set
# CONFIG_MQ_IOSCHED_DEADLINE is not set
# CONFIG_MQ_IOSCHED_KYBER is not set
CONFIG_DAX_DRIVER=y
CONFIG_DAX=y
# CONFIG_DEV_DAX is not set
# CONFIG_FS_DAX is not set

$ cat /sys/block/sda/queue/scheduler
[none] bfq

I'm happy to report any more info about my kernel/system if it would be
helpful, but unfortunately I don't have the bandwidth to do any bisection right
now.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (143 preceding siblings ...)
  2018-12-02 13:35 ` bugzilla-daemon
@ 2018-12-02 13:43 ` bugzilla-daemon
  2018-12-02 14:06 ` bugzilla-daemon
                   ` (125 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-02 13:43 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #144 from Rainer Fiebig (jrf@mailbox.org) ---
(In reply to Daniel Harding from comment #143)
> Another datapoint:  I have observed Ext4 metadata corruption under both
> 4.19.1 and 4.19.4.  I'm using LVM (but no RAID); the underlying drive is a
> 1GB SATA-attached Samsung 850 PRO SSD.  I've not been able to reliably
> reproduce, but an rsync-based backup of my home partition runs once an hour
> and it usually starts reporting corruption errors within a day or two of
> booting a 4.19.x kernel.  So far the corruption has only happened in
> directories that I am not actively using - as far as I know they are only
> being accessed by the rsync process.  Since I started seeing the corruption
> under 4.19.x, I've run 4.18.16 for two stretches, one of which was twelve
> days, without any problems, so I'm quite confident it is not an issue of
> defective hardware.  I have a weekly cron job which runs fstrim, but at
> least once I booted into 4.19.4 (previous boot was 4.18.16), and started
> seeing metadata corruption after about 36 hours, but fstrim had not run
> during that time.
> 
> Some (possibly) relevant kernel configs:
> CONFIG_SCSI_MQ_DEFAULT=y
> # CONFIG_DM_MQ_DEFAULT is not set
> # CONFIG_MQ_IOSCHED_DEADLINE is not set
> # CONFIG_MQ_IOSCHED_KYBER is not set
> CONFIG_DAX_DRIVER=y
> CONFIG_DAX=y
> # CONFIG_DEV_DAX is not set
> # CONFIG_FS_DAX is not set
> 
> $ cat /sys/block/sda/queue/scheduler
> [none] bfq
> 
> I'm happy to report any more info about my kernel/system if it would be
> helpful, but unfortunately I don't have the bandwidth to do any bisection
> right now.

Bisecting just fs/ext4 (comment 79) wouldn't cost much time. Just 32 commits, 5
steps. It won't get much cheaper than that.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (144 preceding siblings ...)
  2018-12-02 13:43 ` bugzilla-daemon
@ 2018-12-02 14:06 ` bugzilla-daemon
  2018-12-02 14:14 ` bugzilla-daemon
                   ` (124 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-02 14:06 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #145 from Rainer Fiebig (jrf@mailbox.org) ---
(In reply to Reindl Harald from comment #138)
> at least continue security updates for 4.18.x would probably be a good idea
> 
> Fedora 28 is already on 4.19.x
> 
> i run now 4.18.20-100.fc27.x86_64 which was the last Fedora 27 update on
> every F28 server and until this problem is solved i refuse to run 4.19.x in
> production which essentially means no security fixes for a unknown amount of
> time

Perhaps you can use one of the LTS-kernels, like 4.14.84.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (145 preceding siblings ...)
  2018-12-02 14:06 ` bugzilla-daemon
@ 2018-12-02 14:14 ` bugzilla-daemon
  2018-12-02 14:21 ` bugzilla-daemon
                   ` (123 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-02 14:14 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #146 from Reindl Harald (harry@rhsoft.net) ---
> Perhaps you can use one of the LTS-kernels, like 4.14.84

on Fedora 28? 
seriously?

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (146 preceding siblings ...)
  2018-12-02 14:14 ` bugzilla-daemon
@ 2018-12-02 14:21 ` bugzilla-daemon
  2018-12-02 14:38 ` bugzilla-daemon
                   ` (122 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-02 14:21 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #147 from Rainer Fiebig (jrf@mailbox.org) ---
(In reply to Reindl Harald from comment #131)
> has anybody ever seen that bug within a virtual machine?
> 
> i currently run 4.19.x only inside VMs on VMware Workstation / VMware ESXi
> and did not see any issues, my only phyiscal test was my homeserver which
> completley crahsed 4 times like because of the VMware Workstation 14
> kernel-modules lasted only for a weekend (RAID10, 2 Samsung Evo 850 2 TB, 2
> Samsung Evo 860 2 TB) after the last crash left a 0 byte "firewall.sh" in
> the nested VM i was working for hours

I've installed 4.19.x with a defconfig in a VirtualBox-VM, hoping the issue
would show up and I could bisect it there. I've also varied the config-params
that have been discussed here.

But unfortunately that damn thing runs as nicely in the VM as it does on real
iron - at least here. :)

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (147 preceding siblings ...)
  2018-12-02 14:21 ` bugzilla-daemon
@ 2018-12-02 14:38 ` bugzilla-daemon
  2018-12-02 14:42 ` bugzilla-daemon
                   ` (121 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-02 14:38 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #148 from Rainer Fiebig (jrf@mailbox.org) ---
(In reply to Reindl Harald from comment #146)
> > Perhaps you can use one of the LTS-kernels, like 4.14.84
> 
> on Fedora 28? 
> seriously?

Sorry, just trying to help. And I didn't know that one can't run LTS-kernels on
Fedora 28.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (148 preceding siblings ...)
  2018-12-02 14:38 ` bugzilla-daemon
@ 2018-12-02 14:42 ` bugzilla-daemon
  2018-12-02 16:57 ` bugzilla-daemon
                   ` (120 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-02 14:42 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #149 from Jens Axboe (axboe@kernel.dk) ---
(In reply to Jimmy.Jazz from comment #121)
> I have the problem again with kernel 4.19.5 (J. Axboe patches). Sorry I
> don't trust 4.19.x without Axboe patch because if some fs corruptions
> reappear they will be less violent than without the patch.

You had the issue with the full block patch applied, the one that includes both
the synchronize_rcu() and the quiesce? Or just the partial one I suggested
earlier?

> Result:
> The system doesn't see any error (fsck) on reboot (rescue and normal boot).

Interesting, so it didn't make it to media.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (149 preceding siblings ...)
  2018-12-02 14:42 ` bugzilla-daemon
@ 2018-12-02 16:57 ` bugzilla-daemon
  2018-12-02 17:48 ` bugzilla-daemon
                   ` (119 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-02 16:57 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

Jukka Santala (donwulff@nic.fi) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |donwulff@nic.fi

--- Comment #150 from Jukka Santala (donwulff@nic.fi) ---
Whole block corrupted at once with each inode/file returning "Structure needs
cleaning" and bad extra_isizes in syslog while three hours in doing plain cp
-ax from ext4 to BTRFS mdraid in Ubuntu 18.04.1 mainline kernel 4.19.6 in init
level 1 (Ubuntu rescue mode with almost nothing else running).

Saved the corrupted block via debugfs, filesystem mounts read-only, dropped
disk caches and block is fine again with files accessible. Seeing if I can find
the corrupted block contents anywhere else in the filesystem. Error first
happened for me in 4.19.5 after running cleanly for days, now it comes
constantly.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (150 preceding siblings ...)
  2018-12-02 16:57 ` bugzilla-daemon
@ 2018-12-02 17:48 ` bugzilla-daemon
  2018-12-02 17:50 ` bugzilla-daemon
                   ` (118 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-02 17:48 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #151 from Jimmy.Jazz@gmx.net ---
(In reply to Jens Axboe from comment #149)

> You had the issue with the full block patch applied, the one that includes
> both the synchronize_rcu() and the quiesce? Or just the partial one I
> suggested earlier?

synchronize_rcu() and the quiesce as you asked me.


> Interesting, so it didn't make it to media.

The following tests has be made on an other computer named orca to not be
confused with earlier comments I have posted.

Again, I can confirm it but only with your patches applied. On orca with 4.20
and w/o your patch the bug was able to entirely wipe out orca postgres database
:(

It gives me the opportunity to do a full reinstall of orca from the stick.
Don't get confused with mmp_node_name host name on the new created partitions,
it has an easy explanation. The bootable stick used to create the filesystems
has a different hostname than the final server (i.e. orca)

Please read the attached bug.orca.tar.xz tar file. You can follow the logs
sequence from the file creation time.

I underline, the new corruption on dm-10 after the server has rebooted has
nothing to do with the one announced  earlier in dmesg. Read dmesg-zalman.txt,
dmesg-zalman-2.txt and dumpe2fs-dm-10-after-e2fsk.txt, dmesg-after-e2fsk.txt in
that order. It shows that dm-10 corruption was initiate during the reboot.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (151 preceding siblings ...)
  2018-12-02 17:48 ` bugzilla-daemon
@ 2018-12-02 17:50 ` bugzilla-daemon
  2018-12-02 18:19 ` bugzilla-daemon
                   ` (117 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-02 17:50 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #152 from Jimmy.Jazz@gmx.net ---
Created attachment 279801
  --> https://bugzilla.kernel.org/attachment.cgi?id=279801&action=edit
new generated server

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (152 preceding siblings ...)
  2018-12-02 17:50 ` bugzilla-daemon
@ 2018-12-02 18:19 ` bugzilla-daemon
  2018-12-02 18:56 ` bugzilla-daemon
                   ` (116 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-02 18:19 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #153 from James Courtier-Dutton (James@superbug.co.uk) ---
(In reply to Theodore Tso from comment #127)
> 
> While there have been a few people who have reported problems with the
> contents of their files, the vast majority of people are reporting problems
> that seem to include complete garbage being written into metadata blocks ---
> i.e., completely garbage in to inode table, block group descriptor, and
> superblocks.    This is getting detected by the kernel noticing corruption,
> or by e2fsck running and noticing that the file system metadata is
> inconsistent.   More modern ext4 file systems have metadata checksum turned
> on, but the reports from e2fsck seem to indicate that complete garbage (or,
> more likely, data meant for block XXX is getting written to block YYY); as
> such, the corruption is not subtle, so generally the kernel doesn't need
> checksums to figure out that the metadata blocks are nonsensical.
> 

Is it possible to determine the locality of these corruptions?
I.e. Is the corruption to a contiguous page of data (e.g. 4096 bytes corrupted)
or is the corruption scattered, a few bytes here, a few bytes there?
>From your comment about "data meant for block XXX is getting written to block
YYY" can I assume this is fact, or is it still TBD?

If it is contiguous data, is there any pattern to the data that would help us
identify where it came from?

Maybe that would help work out where the corruption was coming from.
Maybe it is DMA from some totally unrelated device driver, but by looking at
the data, we might determine which device driver it is?
It might be some vulnerability in the kernel that some hacker is trying to
exploit, but unsuccessfully, resulting in corruption. This could explain the
reason why more people are not seeing the problem.

Some people reporting that the corruptions are not getting persisted to disk in
all cases, might imply that the corruption is happening outside the normal code
paths, because the normal code path would have tagged the change as needing
flushing to disk at some point.

Looking at the corrupted data would also tell us if values are within expected
ranges, that the normal code path would have validated. If they are outside
those ranges, then it would again imply that the corrupt data is not being
written by the normal ext4 code path, thus further implying that there is not a
bug in the ext4 code, but something else in the kernel is writing to it by
mistake.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (153 preceding siblings ...)
  2018-12-02 18:19 ` bugzilla-daemon
@ 2018-12-02 18:56 ` bugzilla-daemon
  2018-12-02 19:07 ` bugzilla-daemon
                   ` (115 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-02 18:56 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #154 from James Courtier-Dutton (James@superbug.co.uk) ---
I have scanned all the comments. So far I have only seen 1 person who has this
problem and have also reported what hardware they have.
So, the sample size is statistically far too small to conclude that it is an
AMD or a INTEL only problem.
Is there anyone out there who sees this problem, and is running Intel hardware?
How many people are seeing this problem? Can they each post the output of
"lspci -vvv" and a dmesg showing the problem they have?

This appears to be a problem that is reported by an extremely small amount of
people.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (154 preceding siblings ...)
  2018-12-02 18:56 ` bugzilla-daemon
@ 2018-12-02 19:07 ` bugzilla-daemon
  2018-12-02 19:10 ` bugzilla-daemon
                   ` (114 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-02 19:07 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #155 from Rainer Fiebig (jrf@mailbox.org) ---
#154 re. Intel: start with comments 3/5/6.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (155 preceding siblings ...)
  2018-12-02 19:07 ` bugzilla-daemon
@ 2018-12-02 19:10 ` bugzilla-daemon
  2018-12-02 19:17 ` bugzilla-daemon
                   ` (113 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-02 19:10 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #156 from Guenter Roeck (linux@roeck-us.net) ---
Status update: I have not been able to reproduce the problem with v4.19.6 minus
the reverts from #99. I did see some failures, specifically exactly one per
affected file system, but I attribute those to false positives (I did not run
fsck as recommended prior to starting the test).

stats:

System 1:

uptime: 18h 45m

iostats:
Device             tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
loop0             0.00         0.00         0.00          0          0
nvme0n1         195.54         3.27         3.96     220723     267555
sda             128.88         0.36        18.91      24659    1277907
sdb             131.40        18.85         5.07    1273780     342404

nvme0n1 and sda were previously affected. sdb is a new drive.

System 2:

uptime: 14h 56m

iostats:
Device             tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
loop0             0.00         0.00         0.00          5          0
sda              26.45       538.25        87.87   28965283    4728576
sdb             108.87      2875.25      4351.42  154728917  234167724

Both sda and sdb were previously affected.

My next step will be to try v4.19.6 with the following reverts:

Revert "ext4: handle layout changes to pinned DAX mappings"
Revert "dax: remove VM_MIXEDMAP for fsdax and device dax"
Revert "ext4: close race between direct IO and ext4_break_layouts()"

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (156 preceding siblings ...)
  2018-12-02 19:10 ` bugzilla-daemon
@ 2018-12-02 19:17 ` bugzilla-daemon
  2018-12-02 19:18 ` bugzilla-daemon
                   ` (112 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-02 19:17 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

Michael Duell (reg@akurei.me) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |reg@akurei.me

--- Comment #157 from Michael Duell (reg@akurei.me) ---
I have had file system corruption with 4.19 on a BTRFS file system as well.
4.19.2 Kernel. I think this has to be related. No files were actually corrupted
but the Kernel set the file system read-only as soon as the error occurred.

I have even tried a NEW and FRESH btrfs file system created via a LiveCD system
and it happened there as well as soon as I did a btrfs send/receive operation.

I am on a Thinkpad T450.

lspci -vvv 
https://paste.pound-python.org/show/9tZLWlry0Iy7Z629VPea/

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (157 preceding siblings ...)
  2018-12-02 19:17 ` bugzilla-daemon
@ 2018-12-02 19:18 ` bugzilla-daemon
  2018-12-02 19:19 ` bugzilla-daemon
                   ` (111 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-02 19:18 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #158 from Michael Duell (reg@akurei.me) ---
Same paste as root: 

https://paste.pound-python.org/show/DZTBYXQBFhcHi69OBba8/

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (158 preceding siblings ...)
  2018-12-02 19:18 ` bugzilla-daemon
@ 2018-12-02 19:19 ` bugzilla-daemon
  2018-12-02 19:20 ` bugzilla-daemon
                   ` (110 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-02 19:19 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #159 from Rainer Fiebig (jrf@mailbox.org) ---
#156: A ray of hope. Underpins Nestors findings (comment 78).

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (159 preceding siblings ...)
  2018-12-02 19:19 ` bugzilla-daemon
@ 2018-12-02 19:20 ` bugzilla-daemon
  2018-12-02 19:22 ` bugzilla-daemon
                   ` (109 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-02 19:20 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #160 from Guenter Roeck (linux@roeck-us.net) ---

Another update: I hit the problem almost immediately with the reverts from
#156.

[ 1826.738686] EXT4-fs error (device sda1): ext4_iget:4796: inode #7633436:
comm borg: bad extra_isize 28534 (inode size 256)
[ 1826.740744] Aborting journal on device sda1-8.
[ 1826.747339] EXT4-fs (sda1): Remounting filesystem read-only

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (160 preceding siblings ...)
  2018-12-02 19:20 ` bugzilla-daemon
@ 2018-12-02 19:22 ` bugzilla-daemon
  2018-12-02 19:29 ` bugzilla-daemon
                   ` (108 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-02 19:22 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #161 from Rainer Fiebig (jrf@mailbox.org) ---
#160: Down to five now.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (161 preceding siblings ...)
  2018-12-02 19:22 ` bugzilla-daemon
@ 2018-12-02 19:29 ` bugzilla-daemon
  2018-12-02 19:34 ` bugzilla-daemon
                   ` (107 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-02 19:29 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #162 from Guenter Roeck (linux@roeck-us.net) ---
As next step, I am going to try v4.19.6 with the following reverts:

Revert "ext4: readpages() should submit IO as read-ahead"
Revert "ext4: improve code readability in ext4_iget()"

Those with btrfs problems might consider reverting commit 5e9d398240b2 ("btrfs:
readpages() should submit IO as read-ahead") and report the results.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (162 preceding siblings ...)
  2018-12-02 19:29 ` bugzilla-daemon
@ 2018-12-02 19:34 ` bugzilla-daemon
  2018-12-02 20:34 ` bugzilla-daemon
                   ` (106 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-02 19:34 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #163 from Jukka Santala (donwulff@nic.fi) ---
First report specifies AMD, second report is Intel, and so on. I agree more
detailed system information might help find commonalities and false positives,
but the cross-platform nature of the problem seemed established right from the
start.

AMD Phenom(tm) II X4 B50 Processor, SSHD ST1000LM014-1EJ164
[11514.358542] EXT4-fs error (device dm-0): ext4_iget:4831: inode #18288150:
comm cp: bad extra_isize 49917 (inode size 256)
[11514.386613] Aborting journal on device dm-0-8.
[11514.389070] EXT4-fs (dm-0): Remounting filesystem read-only

Errors for each of the inodes on the block follow, until I dropped filesystem
caches (drop_caches 3) and accessed them again and they were fine. Corrupted
block looked random binary, but not compressed. BTRFS was reporting csum errors
every time I dropped caches, which makes me wonder if people having the problem
are using BTRFS?

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (163 preceding siblings ...)
  2018-12-02 19:34 ` bugzilla-daemon
@ 2018-12-02 20:34 ` bugzilla-daemon
  2018-12-02 20:45 ` bugzilla-daemon
                   ` (105 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-02 20:34 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

Andreas Dilger (adilger.kernelbugzilla@dilger.ca) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |adilger.kernelbugzilla@dilg
                   |                            |er.ca

--- Comment #164 from Andreas Dilger (adilger.kernelbugzilla@dilger.ca) ---
There was a recent post on linux-ext4 that this might relate to a compiler bug:

> After four days playing games around git bisect - real winner is
> debian gcc-8.2.0-9. Upgrade it to 8.2.0-10 or use 7.3.0-30 version for
> same kernel + config - does not exhibit ext4 corruption.

> I think I hit this https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87859
> with 8.2.0-9 version.

Can people hitting this please confirm or deny whether this compiler is in use
on your system.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (164 preceding siblings ...)
  2018-12-02 20:34 ` bugzilla-daemon
@ 2018-12-02 20:45 ` bugzilla-daemon
  2018-12-02 20:47 ` bugzilla-daemon
                   ` (104 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-02 20:45 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #165 from Guenter Roeck (linux@roeck-us.net) ---
groeck@server:~$ gcc --version
gcc (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (165 preceding siblings ...)
  2018-12-02 20:45 ` bugzilla-daemon
@ 2018-12-02 20:47 ` bugzilla-daemon
  2018-12-02 20:57 ` bugzilla-daemon
                   ` (103 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-02 20:47 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #166 from Jukka Santala (donwulff@nic.fi) ---
However, at least here:
cat /proc/version
Linux version 4.19.6-041906-generic (kernel@gloin) (gcc version 8.2.0 (Ubuntu
8.2.0-9ubuntu1)) #201812010432 SMP Sat Dec 1 09:34:07 UTC 2018

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (166 preceding siblings ...)
  2018-12-02 20:47 ` bugzilla-daemon
@ 2018-12-02 20:57 ` bugzilla-daemon
  2018-12-02 21:34 ` bugzilla-daemon
                   ` (102 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-02 20:57 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #167 from Bart Van Assche (bvanassche@acm.org) ---
All 4.19.x kernels I tested were built with gcc version 8.2.1 20181025
[gcc-8-branch revision 265488].

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (167 preceding siblings ...)
  2018-12-02 20:57 ` bugzilla-daemon
@ 2018-12-02 21:34 ` bugzilla-daemon
  2018-12-03  0:07 ` bugzilla-daemon
                   ` (101 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-02 21:34 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

Michel Roelofs (michel@michelroelofs.nl) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |michel@michelroelofs.nl

--- Comment #168 from Michel Roelofs (michel@michelroelofs.nl) ---
Hereby my experience that may be related:

[ 2451.982816] EXT4-fs error (device dm-1): ext4_iget:4831: inode #6029313:
comm ls: bad extra_isize 5 (inode size 256)
root@ster:/# debugfs -R 'ncheck 6029313' /dev/dm-1
debugfs 1.43.4 (31-Jan-2017)
Inode   Pathname
6029313 //sabnzb
ncheck: Inode checksum does not match inode while doing inode scan

root@ster:/# echo 2 > /proc/sys/vm/drop_caches
root@ster:/# debugfs -R 'ncheck 6029313' /dev/dm-1
debugfs 1.43.4 (31-Jan-2017)
Inode   Pathname
6029313 //sabnzb
ncheck: Inode checksum does not match inode while doing inode scan

root@ster:/#  echo 3 > /proc/sys/vm/drop_caches
root@ster:/# debugfs -R 'ncheck 6029313' /dev/dm-1
debugfs 1.43.4 (31-Jan-2017)
Inode   Pathname
6029313 //sabnzb

Kernel v4.19.5, CPU Intel Atom D525, Debian Linux 9.6, brand new WDC
WD40EFRX-68N32N0, gcc 6.3.0-18+deb9u1.

Also seen with an ext4 filesystem created on Nov 21 2018.

Also seen with earlier 4.19.0 kernel, and older WDC WD30EFRX-68A in same
computer.

Going back to v4.18.<latest> kernel solved the issues. No disk corruption shown
by e2fsck.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (168 preceding siblings ...)
  2018-12-02 21:34 ` bugzilla-daemon
@ 2018-12-03  0:07 ` bugzilla-daemon
  2018-12-03  0:11 ` bugzilla-daemon
                   ` (100 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-03  0:07 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

Sune Mølgaard (molgaard@gmail.com) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |molgaard@gmail.com

--- Comment #169 from Sune Mølgaard (molgaard@gmail.com) ---
FWIW, I didn't see any problems with 4.19.0, but see it on all my systems with
4.19.3 and onward (although I *did* skip 4.19.[12].

There fore, I embarked on a git bisect in linux-stable from v4.19 to v4.19.3,
which is nearing its end, *so far with every iteration marked GOOD*.

Referencing https://www.spinics.net/lists/linux-ext4/msg63498.html (#164), and
noting that I usually run kernels from kernel.ubuntu.com/~kernel-ppa/mainline ,
I did the following:

smo@dell-smo:~$ cat /proc/version 
Linux version 4.19.0-041900-generic (kernel@tangerine) (gcc version 8.2.0
(Ubuntu 8.2.0-7ubuntu1)) #201810221809 SMP Mon Oct 22 22:11:45 UTC 2018

Then, I downloaded 4.19.3 from kernel-ppa, unpacked, and:

smo@dell-smo:~/src/deb/foo/boot$ strings vmlinuz-4.19.3-041903-generic |grep
8.2.0
4.19.3-041903-generic (kernel@gloin) (gcc version 8.2.0 (Ubuntu
8.2.0-9ubuntu1)) #201811210435 SMP Wed Nov 21 09:37:20 UTC 2018

BANG, as they say: 8.2.0-9.

Whereas git bisect "GOOD"s continuously (as stated, it is not complete - only
almost) are not impossible, they certainly don't seem entirely normal, but:

sune@jekaterina:~$ gcc --version
gcc (Ubuntu 8.2.0-7ubuntu1) 8.2.0

...on the system where I self-compile during the bisect, *could* explain it.

My impression is, that a lot of affected people are on Ubuntu, and I suspect
the following:

* Many of the affected Ubuntu folks do indeed use kernels from kernel-ppa

* Some of those, as well as non-Ubuntu-folks, may have that compiler version
for other reasons, and hit the bugs on that account

* Bisecting yields inconclusive results, as it seems to do for me, since the
Issues is non-kernel.

* Theodore T'so and Jens Axboe are unable to reproduce due to unaffected
compiler versions, which also explains the no-show in regression tests.

Tso, Axboe: With the two of you being completely unable to replicate, could you 
be enticed to either try GCC 8.2.0-9 (or, possibly, just the packages from the
following URLs, and run your regression tests against those?

http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.19 (presumed GOOD)

http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.19 (presumed BAD)

Best regards,

Sune Mølgaard

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (169 preceding siblings ...)
  2018-12-03  0:07 ` bugzilla-daemon
@ 2018-12-03  0:11 ` bugzilla-daemon
  2018-12-03  0:28 ` bugzilla-daemon
                   ` (99 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-03  0:11 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #170 from Sune Mølgaard (molgaard@gmail.com) ---
Meh, typos:

1. "(or, possibly..." should end the parenthesis after "...the following URLs)

2. Last link should be http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.19.3

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (170 preceding siblings ...)
  2018-12-03  0:11 ` bugzilla-daemon
@ 2018-12-03  0:28 ` bugzilla-daemon
  2018-12-03  0:59 ` bugzilla-daemon
                   ` (98 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-03  0:28 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #171 from Guenter Roeck (linux@roeck-us.net) ---
To reiterate, I use gcc version "5.4.0-6ubuntu1~16.04.10" to build my kernels.
Also, I build the kernels on a system not affected by the problem. It may well
be that a compiler problem in gcc 8.2.0 causes additional trouble, but it is
not the trouble observed on my affected systems.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (171 preceding siblings ...)
  2018-12-03  0:28 ` bugzilla-daemon
@ 2018-12-03  0:59 ` bugzilla-daemon
  2018-12-03  1:03 ` bugzilla-daemon
                   ` (97 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-03  0:59 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #172 from Michael Orlitzky (michael@orlitzky.com) ---
$ gcc --version
gcc (Gentoo Hardened 7.3.0-r3 p1.4) 7.3.0

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (172 preceding siblings ...)
  2018-12-03  0:59 ` bugzilla-daemon
@ 2018-12-03  1:03 ` bugzilla-daemon
  2018-12-03  1:04 ` bugzilla-daemon
                   ` (96 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-03  1:03 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

Lei Ming (tom.leiming@gmail.com) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |tom.leiming@gmail.com

--- Comment #173 from Lei Ming (tom.leiming@gmail.com) ---
The commit 2a5cf35cd6c56b2924("block: fix single range discard merge") in linus
tree may address one possible data loss, anyone who saw corruption in scsi
may try this fix and see if it makes a difference.

Given the merged discard request isn't removed from elevator queue, it might
be possible to be submitted to hardware again.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (173 preceding siblings ...)
  2018-12-03  1:03 ` bugzilla-daemon
@ 2018-12-03  1:04 ` bugzilla-daemon
  2018-12-03  1:17 ` bugzilla-daemon
                   ` (95 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-03  1:04 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #174 from Theodore Tso (tytso@mit.edu) ---
One of the reasons why this is bug hunt is so confounding.   While I was
looking at older reports to try to see if I could find common factors, I found
Jimmy's dmesg report in #50, and this one looks different from many of the
others that people have reported.   In this one, the EXT4 errors are preceeded
by a USB disconnect followed by disk-level errors.    

This is why it's important that we try very hard to filter out false positives
and false negative reports.  We have multiple reports which both strongly
indicate that it's an ext4 bug, and others which strongly indicate it is a bug
below the file system layer.   And then we have ones like this which look like
a USB disconnect....

[52967.931390] usb 4-1: reset SuperSpeed Gen 1 USB device number 2 using
xhci_hcd
[52968.985620] sd 8:0:0:2: [sdf] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x00
driverbyte=0x08
[52968.985624] sd 8:0:0:2: [sdf] tag#0 Sense Key : 0x6 [current] 
[52968.985626] sd 8:0:0:2: [sdf] tag#0 ASC=0x28 ASCQ=0x0 
[52968.985628] sd 8:0:0:2: [sdf] tag#0 CDB: opcode=0x2a 2a 00 00 cc 60 28 00 08
00 00
[52968.985630] print_req_error: I/O error, dev sdf, sector 13393960
[52968.985641] EXT4-fs warning (device sdf2): ext4_end_bio:323: I/O error 10
writing to inode 522 (offset 6132072448 size 6295552 starting block 1674501)
[52968.985643] Buffer I/O error on device sdf2, logical block 1673728
[52968.985651] Buffer I/O error on device sdf2, logical block 1673729
[52968.985654] Buffer I/O error on device sdf2, logical block 1673730
[52968.985659] Buffer I/O error on device sdf2, logical block 1673731
[52968.985663] Buffer I/O error on device sdf2, logical block 1673732
...
[52968.986231] EXT4-fs warning (device sdf2): ext4_end_bio:323: I/O error 10
writing to inode 522 (offset 6132072448 size 8388608 starting block 1675013)
[52969.435367] JBD2: Detected IO errors while flushing file data on sdf2-8
[52969.435407] Aborting journal on device sdf2-8.
[52969.435422] JBD2: Error -5 detected when updating journal superblock for
sdf2-8.
[52969.441997] EXT4-fs error (device sdf2): ext4_journal_check_start:61:
Detected aborted journal
[52985.065239] EXT4-fs error (device sdf2): ext4_remount:5188: Abort forced by
user

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (174 preceding siblings ...)
  2018-12-03  1:04 ` bugzilla-daemon
@ 2018-12-03  1:17 ` bugzilla-daemon
  2018-12-03  1:23 ` bugzilla-daemon
                   ` (94 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-03  1:17 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #175 from Sune Mølgaard (molgaard@gmail.com) ---
I guess I may have been biased towards the posts mentioning the GCC bug, then,
but that would lead me to think that I am not alone in conflating that one with
actual ext4 or block layer bugs.

I shall go ahead and reference my comment above (#169) to the Ubuntu kernel-ppa
folks, and in the event that this will then preclude others from
mis-attributing the GCC bug to these, I should hope to at least effect an
elimination of that noise source from this Bugzilla entry.

My apologies, and keep up the good work!

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (175 preceding siblings ...)
  2018-12-03  1:17 ` bugzilla-daemon
@ 2018-12-03  1:23 ` bugzilla-daemon
  2018-12-03  1:37 ` bugzilla-daemon
                   ` (93 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-03  1:23 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #176 from Theodore Tso (tytso@mit.edu) ---
Hi Sune, alas for your theory in #169, I am already using gcc 8.2.0-9 from
Debian testing.

% gcc --version
gcc (Debian 8.2.0-9) 8.2.0

Could it an Ubuntu-specific issue?   I don't think so, since there have been
some people running Debian and Gentoo who have reported the problem, and one
person who reported the problem was running Debian and was using gcc 8.2.0-9.

I have built kernels using gcc 8.2.0-9 and used them for regression testing
using gce-xfstests:

% objdump -s --section .comment /build/ext4-64/vmlinux

Contents of section .comment:
 0000 4743433a 20284465 6269616e 20382e32  GCC: (Debian 8.2
 0010 2e302d39 2920382e 322e3000           .0-9) 8.2.0. 

The kernel I am using on my personal development laptop was compiled using gcc
8.2.0-8:

% objdump -s --section .comment
/usr/lib/debug/lib/modules/4.19.0-00022-g831156939ae8/vmlinux  

Contents of section .comment:
 0000 4743433a 20284465 6269616e 20382e32  GCC: (Debian 8.2
 0010 2e302d38 2920382e 322e3000           .0-8) 8.2.0.

Of course, I'm not doing anything more exciting than running chrome, mutt,
emacs, and building kernels most of the time...

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (176 preceding siblings ...)
  2018-12-03  1:23 ` bugzilla-daemon
@ 2018-12-03  1:37 ` bugzilla-daemon
  2018-12-03  1:48 ` bugzilla-daemon
                   ` (92 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-03  1:37 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #177 from Néstor A. Marchesini (nestorm_des@hotmail.com) ---
I did a lot of tests here, the first thing was to configure with tune2fs so
that in each boot I forcefully check my three partitions, the / boot the / root
and the / home partition.

# tune2fs -c 1 /dev/md0
# tune2fs -c 1 /dev/md2
# tune2fs -c 1 /dev/md3

I have reinstalled and compiled tree 4.19.5 and 4.19.0 from scratch, as well as
the new tree 4.19.6.
I have not had problems with the 4.19.5 or with the new 4.19.6, many hours of
use and restarts every time .. everything perfect.
But at the first boot with 4.19.0 ... corruption of the root partition.
it leaves me in the console for repair, I repair it with:

# fsck.ext4 -y /dev/md2

After started, I'll see /lost+found and find many folders and files in perfect
condition, not corrupt, but with the numeric names #

# ls -l /lost+found/
total 76
-rw-r--r-- 1 portage portage 5051 dic 10  2013 '#1057825'
drwxr-xr-x 3 portage portage 4096 dic 10  2013 '#1057827'
-rw-r--r-- 1 root    root    2022 oct 22 03:37 '#3184673'
-rw-r--r-- 1 root    root     634 oct 22 03:37 '#3184674'
etc...
etc...

So decided I started with the bisection, download only from 4.18 onwards.

$ su
# cd /usr/src
# git clone
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
--shallow-exclude v4.17 linux-stable

# eselect kernel list
Available kernel symlink targets:
  [1]   linux-4.18.20-gentoo
  [2]   linux-4.19.0-gentoo
  [3]   linux-4.19.5-gentoo
  [4]   linux-4.19.6-gentoo *
  [5]   linux-stable

# eselect kernel set 5

# eselect kernel list 
Available kernel symlink targets:
  [1]   linux-4.18.20-gentoo
  [2]   linux-4.19.0-gentoo
  [3]   linux-4.19.5-gentoo
  [4]   linux-4.19.6-gentoo
  [5]   linux-stable *

# ls -l
total 20
lrwxrwxrwx  1 root root   12 dic  2 21:27 linux -> linux-stable
drwxr-xr-x 27 root root 4096 nov 24 14:44 linux-4.18.20-gentoo
drwxr-xr-x 27 root root 4096 dic  2 20:28 linux-4.19.0-gentoo
drwxr-xr-x 27 root root 4096 dic  2 03:47 linux-4.19.5-gentoo
drwxr-xr-x 27 root root 4096 dic  2 14:50 linux-4.19.6-gentoo
drwxr-xr-x 26 root root 4096 dic  2 19:18 linux-stable

# cd linux
# git bisect start v4.19 v4.18 -- fs/ext4
Bisectando: faltan 16 revisiones por probar después de esto (aproximadamente 4
pasos)
[863c37fcb14f8b66ea831b45fb35a53ac4a8d69e] ext4: remove unneeded variable "err"
in ext4_mb_release_inode_pa()

# git bisect log
# bad: [84df9525b0c27f3ebc2ebb1864fa62a97fdedb7d] Linux 4.19
# good: [94710cac0ef4ee177a63b5227664b38c95bbf703] Linux 4.18
git bisect start 'v4.19' 'v4.18' '--' 'fs/ext4'

Just beginning, today was Sunday and ... besides little experience with git :)
I was also looking at the ebuilds of the gentoo-sources trees to know what
patches I applied to emerge when installing the sources.

$ cat /usr/portage/sys-kernel/gentoo-sources/gentoo-sources-4.18.20.ebuild
|grep K_GENPATCHES_VER=
K_GENPATCHES_VER="24"
$ ls -lh /usr/portage/distfiles/genpatches-4.18-24.base.tar.xz 
-rw-rw-r-- 1 portage portage 661K nov 21 10:13
/usr/portage/distfiles/genpatches-4.18-24.base.tar.xz
$ tar -tf /usr/portage/distfiles/genpatches-4.18-24.base.tar.xz 
./0000_README
./1000_linux-4.18.1.patch
./1001_linux-4.18.2.patch
./1002_linux-4.18.3.patch
./1003_linux-4.18.4.patch
./1004_linux-4.18.5.patch
./1005_linux-4.18.6.patch
./1006_linux-4.18.7.patch
./1007_linux-4.18.8.patch
./1008_linux-4.18.9.patch
./1009_linux-4.18.10.patch
./1010_linux-4.18.11.patch
./1011_linux-4.18.12.patch
./1012_linux-4.18.13.patch
./1013_linux-4.18.14.patch
./1014_linux-4.18.15.patch
./1015_linux-4.18.16.patch
./1016_linux-4.18.17.patch
./1017_linux-4.18.18.patch
./1018_linux-4.18.19.patch
./1019_linux-4.18.20.patch
./1500_XATTR_USER_PREFIX.patch
./1510_fs-enable-link-security-restrictions-by-default.patch
./2500_usb-storage-Disable-UAS-on-JMicron-SATA-enclosure.patch
./2600_enable-key-swapping-for-apple-mac.patch

$ tar -xf /usr/portage/distfiles/genpatches-4.18-24.base.tar.xz
./1019_linux-4.18.20.patch
$ ls -lh 1019_linux-4.18.20.patch
-rw-r--r-- 1 nestor nestor 164K nov 21 10:01 1019_linux-4.18.20.patch

$ cat /usr/portage/sys-kernel/gentoo-sources/gentoo-sources-4.19.0.ebuild |grep
K_GENPATCHES_VER=
K_GENPATCHES_VER="1"
$ ls -lh /usr/portage/distfiles/genpatches-4.19-1.base.tar.xz
-rw-rw-r-- 1 portage portage 4,0K oct 22 08:47
/usr/portage/distfiles/genpatches-4.19-1.base.tar.xz
$ tar -tf /usr/portage/distfiles/genpatches-4.19-1.base.tar.xz
./0000_README
./1500_XATTR_USER_PREFIX.patch
./1510_fs-enable-link-security-restrictions-by-default.patch
./2500_usb-storage-Disable-UAS-on-JMicron-SATA-enclosure.patch
./2600_enable-key-swapping-for-apple-mac.patch

As you can see in the 4.19.0 tree do not apply patches 1000_linux-4.19.x.patch

My gcc version for quite some time:
$ gcc -v
gcc versión 8.2.0 (Gentoo 8.2.0-r5 p1.6

Obviously something happens with the inodes, but apparently only I'm doing it
now with the tree 4.19.0.
If I find something I will be reporting it.

Regards

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (177 preceding siblings ...)
  2018-12-03  1:37 ` bugzilla-daemon
@ 2018-12-03  1:48 ` bugzilla-daemon
  2018-12-03  1:50 ` bugzilla-daemon
                   ` (91 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-03  1:48 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #178 from Sune Mølgaard (molgaard@gmail.com) ---
Hi Theodore,

I am not much of a kernel developer, let alone and FS one, so your guesses
would be vastly better founded than mine.

I could imagine, though, that a combination of GCC version, .config and,
possibly, the creation time (kernel version-wise) of the FSs in question, could
create a sort of "cocktail effect". For my part, none of my FSs are < at least
a year old.

FWIW, I started seeing the problem specifically with 4.19.3 (4.19.0 being good,
and built with 8.2.0-7), but that was after skipping 4.19.[12].

I note that the first Ubuntu kernel-ppa kernel the be built with 8.2.0-9 was
4.19.1, so if my ongoing bisect ends without any triggering of the bug I see, I
shall try kernel-ppa 4.19.1 - if that exhibits the bug, then that further
points to GCC, but as you say, perhaps specifically for the Ubuntu kernels.

Now, as someone else stated somewhere, the only things that kernel-ppa patches,
are some Ubuntu-specific build and package structure, as well as .config hte
lst part being available at
http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.19.1/0006-configs-based-on-Ubuntu-4.19.0-4.5.patch
.

As promised above, I have written the kernel-ppa team lead, Bradd Figg, and I
would expect him and his team to be better at pinpointing which combination of
GCC 8.2.0-9 and .config options might be problematic, but if they find that the
problem goes away with a GCC upgrade, they might opt for letting that be it.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (178 preceding siblings ...)
  2018-12-03  1:48 ` bugzilla-daemon
@ 2018-12-03  1:50 ` bugzilla-daemon
  2018-12-03  2:31 ` bugzilla-daemon
                   ` (90 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-03  1:50 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #179 from Theodore Tso (tytso@mit.edu) ---
Michael Orlitzky: In your report, you've indicated that you've only been seeing
bugs in files that are being *read* and that these were files that were written
long ago.    If you reboot, or drop caches using "echo 3 >
/proc/sys/vm/drop_caches" do the files stay corrupted?    Some of the reports
(but not others) seem to indicate the problem is happening on read, not on
write.  Of course, some of the reports are relating to metadata blocks getting
corrupted on read, while your report is about data blocks getting reported on
read. 

Thanks!

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (179 preceding siblings ...)
  2018-12-03  1:50 ` bugzilla-daemon
@ 2018-12-03  2:31 ` bugzilla-daemon
  2018-12-03  2:43 ` bugzilla-daemon
                   ` (89 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-03  2:31 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #180 from Eric Benoit (eric@ecks.ca) ---
I've been able to reproduce the issue on the other workstation I'd mentioned
earlier with ZFS:

NAME                        STATE     READ WRITE CKSUM
nipigon                     ONLINE       0     0     0 
  mirror-0                  ONLINE       0     0     0 
    wwn-0x5000c5006407e87e  ONLINE       0     0    17 
    wwn-0x5000c5004e4e92a9  ONLINE       0     0     6 

This isn't particularly hard to trigger; just a bunch of files filled with
/dev/urandom (64M*14000) being read back concurrently (eight processes) over
about 90 minutes.

Kernel version is 4.19.6 compiled with gcc 8.2.0 (Gentoo 8.2.0-r5 p1.6).

Have we any other ZFS users experiencing this?

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (180 preceding siblings ...)
  2018-12-03  2:31 ` bugzilla-daemon
@ 2018-12-03  2:43 ` bugzilla-daemon
  2018-12-03  2:53 ` bugzilla-daemon
                   ` (88 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-03  2:43 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #181 from Guenter Roeck (linux@roeck-us.net) ---
#180: Eric, would you mind sharing the script used to create the files and to
read them back ?

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (181 preceding siblings ...)
  2018-12-03  2:43 ` bugzilla-daemon
@ 2018-12-03  2:53 ` bugzilla-daemon
  2018-12-03  3:00 ` bugzilla-daemon
                   ` (87 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-03  2:53 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #182 from Guenter Roeck (linux@roeck-us.net) ---
As for where problems are seen, for my part the problem is seen mostly when
trying to read files created recently as part of kernel builds. The problems
are reported with reads, but writes are definitely involved, at least for me.

As for blaming gcc, or Ubuntu, or both, I would kindly like to remind people
that I see the problem on two systems out of four running v4.19.x kernels, all
with the same kernel build and configuration.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (182 preceding siblings ...)
  2018-12-03  2:53 ` bugzilla-daemon
@ 2018-12-03  3:00 ` bugzilla-daemon
  2018-12-03  3:08 ` bugzilla-daemon
                   ` (86 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-03  3:00 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #183 from Theodore Tso (tytso@mit.edu) ---
Eric, re #180, could you upload your .config file for your kernel and the
boot-command line?   I'm interested in particular what I/O scheduler you are
using.   And are you using the same .config and boot command line (and other
system configurations) on your other system where you were seeing the problem? 
Many thanks!!

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (183 preceding siblings ...)
  2018-12-03  3:00 ` bugzilla-daemon
@ 2018-12-03  3:08 ` bugzilla-daemon
  2018-12-03  3:34 ` bugzilla-daemon
                   ` (85 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-03  3:08 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #184 from Lei Ming (tom.leiming@gmail.com) ---
The commit 2a5cf35cd6c56b2924("block: fix single range discard merge") in linus
tree may address one possible data loss, anyone who saw corruption in scsi
may try this fix and see if it makes a difference.

Given the merged discard request isn't removed from elevator queue, it might
be possible to be submitted to hardware again.(In reply to Theodore Tso from
comment #174)
> One of the reasons why this is bug hunt is so confounding.   While I was
> looking at older reports to try to see if I could find common factors, I
> found Jimmy's dmesg report in #50, and this one looks different from many of
> the others that people have reported.   In this one, the EXT4 errors are
> preceeded by a USB disconnect followed by disk-level errors.    
> 
> This is why it's important that we try very hard to filter out false
> positives and false negative reports.  We have multiple reports which both
> strongly indicate that it's an ext4 bug, and others which strongly indicate
> it is a bug below the file system layer.   And then we have ones like this
> which look like a USB disconnect....
> 
> [52967.931390] usb 4-1: reset SuperSpeed Gen 1 USB device number 2 using
> xhci_hcd

IMO it should be a usb device reset instead of disconnect, and reset
is often triggered in SCSI EH.

Thanks,

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (184 preceding siblings ...)
  2018-12-03  3:08 ` bugzilla-daemon
@ 2018-12-03  3:34 ` bugzilla-daemon
  2018-12-03  3:40 ` bugzilla-daemon
                   ` (84 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-03  3:34 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #185 from Eric Benoit (eric@ecks.ca) ---
(In reply to Guenter Roeck from comment #181)
> #180: Eric, would you mind sharing the script used to create the files and
> to read them back ?

Just a pair of trivial one-liners:

for i in {00000..13999}; do echo dd bs=1M count=64 if=/dev/urandom of=urand.$i;
done

for i in urand.*; do echo dd bs=1M if=$i of=/dev/null; done | parallel -j8

I'm using /dev/urandom since I have lz4 compression enabled. I imagine
/dev/zero would be just as effective if you don't.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (185 preceding siblings ...)
  2018-12-03  3:34 ` bugzilla-daemon
@ 2018-12-03  3:40 ` bugzilla-daemon
  2018-12-03  3:51 ` bugzilla-daemon
                   ` (83 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-03  3:40 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #186 from Eric Benoit (eric@ecks.ca) ---
Created attachment 279807
  --> https://bugzilla.kernel.org/attachment.cgi?id=279807&action=edit
tecciztecatl linux kernel 4.19.6 .config

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (186 preceding siblings ...)
  2018-12-03  3:40 ` bugzilla-daemon
@ 2018-12-03  3:51 ` bugzilla-daemon
  2018-12-03  4:06 ` bugzilla-daemon
                   ` (82 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-03  3:51 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #187 from Eric Benoit (eric@ecks.ca) ---
(In reply to Theodore Tso from comment #183)
> Eric, re #180, could you upload your .config file for your kernel and the
> boot-command line?   I'm interested in particular what I/O scheduler you are
> using.   And are you using the same .config and boot command line (and other
> system configurations) on your other system where you were seeing the
> problem?  Many thanks!!

[    0.000000] Command line: BOOT_IMAGE=/root@/boot/vmlinuz root=simcoe/root
triggers=zfs radeon.dpm=1

# cat /sys/block/sd[a-d]/queue/scheduler    
[none] 
[none] 
[none] 
[none] 

The config between 4.18.20 and 4.19.6 are about as identical as possible, the
only differences being whatever was added in 4.19 and prompted by make
oldconfig.

Between this machine and the other (a server) the only differences would be in
specific hardware support and options suitable for that application. In terms
of schedulers, block devices, and filesystem support, they're the same.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (187 preceding siblings ...)
  2018-12-03  3:51 ` bugzilla-daemon
@ 2018-12-03  4:06 ` bugzilla-daemon
  2018-12-03  4:31 ` bugzilla-daemon
                   ` (81 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-03  4:06 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #188 from Eric Benoit (eric@ecks.ca) ---
(In reply to Eric Benoit from comment #185)
> for i in {00000..13999}; do echo dd bs=1M count=64 if=/dev/urandom
> of=urand.$i; done

Er whoops, might want to remove the echo or pipe it to parallel.

I'm repeating things under 4.18.20 just for comparison. It's been about an hour
now without a single checksum error reported.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (188 preceding siblings ...)
  2018-12-03  4:06 ` bugzilla-daemon
@ 2018-12-03  4:31 ` bugzilla-daemon
  2018-12-03  8:33 ` bugzilla-daemon
                   ` (80 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-03  4:31 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #189 from Michael Orlitzky (michael@orlitzky.com) ---
(In reply to Theodore Tso from comment #179)
> Michael Orlitzky: In your report, you've indicated that you've only been
> seeing bugs in files that are being *read* and that these were files that
> were written long ago.    If you reboot, or drop caches using "echo 3 >
> /proc/sys/vm/drop_caches" do the files stay corrupted? 

Each time the corruption has been reported by the backup job that I run
overnight. When I see the failed report in the morning, I reboot into
SystemRescueCD (which is running 4.14.x) and then run fsck to fix things. The
fsck does indeed find a bunch of corruption, and appears to fix it.

The first couple of times I verified the corruption by running something like
"git gc" in the affected directory, and IIRC I got the same "structure needs
cleaning" error back. Before that, I hadn't touched that repo in a while. But
since then, I've just been rebooting immediately and running fsck -- each time
finding something wrong and (I hope) correcting it.

It takes about a week for the corruption to show up, but if there's some test
you need me to run I can boot back into 4.19.6 and roll the dice.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (189 preceding siblings ...)
  2018-12-03  4:31 ` bugzilla-daemon
@ 2018-12-03  8:33 ` bugzilla-daemon
  2018-12-03  9:24 ` bugzilla-daemon
                   ` (79 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-03  8:33 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #190 from Andreas John (himself@derjohn.de) ---
(In reply to Michael Orlitzky from comment #189)
> It takes about a week for the corruption to show up, but if there's some
> test you need me to run I can boot back into 4.19.6 and roll the dice.

Hm, interesting that it happens to my Ubuntu Bionic on bare metal Macbook Pro
(SSD) within minutes - even if I don't habe much I/O load. I am doing the fsck
mit 4.18.20. What could be the difference?

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (190 preceding siblings ...)
  2018-12-03  8:33 ` bugzilla-daemon
@ 2018-12-03  9:24 ` bugzilla-daemon
  2018-12-03 10:42 ` bugzilla-daemon
                   ` (78 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-03  9:24 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #191 from Marc Burkhardt (marc@osknowledge.org) ---
(In reply to Eric Benoit from comment #185)
> (In reply to Guenter Roeck from comment #181)
> > #180: Eric, would you mind sharing the script used to create the files and
> > to read them back ?
> 
> Just a pair of trivial one-liners:
> 
> for i in {00000..13999}; do echo dd bs=1M count=64 if=/dev/urandom
> of=urand.$i; done
> 
> for i in urand.*; do echo dd bs=1M if=$i of=/dev/null; done | parallel -j8
> 
> I'm using /dev/urandom since I have lz4 compression enabled. I imagine
> /dev/zero would be just as effective if you don't.

Don't know if my comments are relevant as I got no reply as of now but here
some info regarding this test:

I ran it without errors on my /home partition wich is a dm-crypt ext4 setup
using the deadline mq-scheduler and the gcc 8 compiler branch. The partition is
mounted 

/dev/mapper/crypt-home    on  /home                                         
type  ext4       
(rw,nosuid,noatime,nodiratime,quota,usrquota,grpquota,errors=remount-ro)


[    0.000000] Linux version 4.19.6loc64 (marc@marc) (gcc version 8.2.0 (Gentoo
Hardened 8.2.0-r4 p1.5)) #1 SMP PREEMPT Sat Dec 1 16:00:21 CET 2018
[    0.000000] Command line: BOOT_IMAGE=/vmlinuz-4.19.6loc64 root=/dev/sda7 ro
init=/sbin/openrc-init root=PARTUUID=6d19e60a-72a8-ee44-89f4-cc6f85a9436c
real_root=/dev/sda7 ro resume=PARTUUID=fbc25a25-2d09-634d-9e8b-67308f2feddf
real_resume=/dev/sda8 acpi_osi=Linux libata.dma=3 libata.noacpi=0 threadirqs
rootfstype=ext4 acpi_sleep=s3_bios,s3_beep devtmpfs.mount=0 net.ifnames=0
vmalloc=512M noautogroup elevator=deadline libata.force=noncq nmi_watchdog=0
i915.modeset=0 cgroup_disable=memory scsi_mod.use_blk_mq=y dm_mod.use_blk_mq=y
vgacon.scrollback_persistent=1 processor.ignore_ppc=1 intel_iommu=igfx_off
crashkernel=128M apparmor=1 security=apparmor nouveau.noaccel=0
nouveau.nofbaccel=1 nouveau.modeset=1 nouveau.runpm=0
nouveau.debug=disp=trace,i2c=trace,bios=trace nouveau.config=NvPmShowAll=true
[    0.000000] KERNEL supported cpus:
[    0.000000]   Intel GenuineIntel


Might be worth getting this guy aboard - got now reply though.

https://www.phoronix.com/forums/forum/software/general-linux-open-source/1063976-some-users-have-been-hitting-ext4-file-system-corruption-on-linux-4-19?p=1064826#post1064826

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (191 preceding siblings ...)
  2018-12-03  9:24 ` bugzilla-daemon
@ 2018-12-03 10:42 ` bugzilla-daemon
  2018-12-03 10:46 ` bugzilla-daemon
                   ` (77 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-03 10:42 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #192 from Rainer Fiebig (jrf@mailbox.org) ---
#136

It has been suggested that I/O-schedulers may play a role in this. So here's
are my settings for 4.19.x for comparison. They deviate from yours in some
points but I really don't know whether this has any relevance. You may want to
give it a try anyway. As I've said, 4.19.x is a nice kernel here.

> grep -i sched .config_4.19-rc5 
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y
CONFIG_CGROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
CONFIG_RT_GROUP_SCHED=y
CONFIG_SCHED_AUTOGROUP=y
CONFIG_SCHED_OMIT_FRAME_POINTER=y
CONFIG_SCHED_SMT=y
CONFIG_SCHED_MC=y
CONFIG_SCHED_MC_PRIO=y
CONFIG_SCHED_HRTICK=y
# CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL is not set
# CONFIG_CPU_FREQ_GOV_SCHEDUTIL is not set
# IO Schedulers
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y
CONFIG_CFQ_GROUP_IOSCHED=y
CONFIG_DEFAULT_IOSCHED="deadline"
CONFIG_MQ_IOSCHED_DEADLINE=y
CONFIG_MQ_IOSCHED_KYBER=y
# CONFIG_IOSCHED_BFQ is not set
CONFIG_NET_SCHED=y
# Queueing/Scheduling
CONFIG_USB_EHCI_TT_NEWSCHED=y
CONFIG_SCHED_INFO=y
CONFIG_SCHED_TRACER=y

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (192 preceding siblings ...)
  2018-12-03 10:42 ` bugzilla-daemon
@ 2018-12-03 10:46 ` bugzilla-daemon
  2018-12-03 11:02 ` bugzilla-daemon
                   ` (76 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-03 10:46 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #193 from Ortwin Glück (odi@odi.ch) ---
I am desperately trying to reproduce this in a qemu/KVM virtual machine with
the configs given by those users. But until now to no avail.

If anybody has seen this issue in a VM please share your .config, mount options
of all filesystems, kernel command line, and possibly workload that you are
running.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (193 preceding siblings ...)
  2018-12-03 10:46 ` bugzilla-daemon
@ 2018-12-03 11:02 ` bugzilla-daemon
  2018-12-03 11:03 ` bugzilla-daemon
                   ` (75 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-03 11:02 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #194 from Marc Burkhardt (marc@osknowledge.org) ---
(In reply to Rainer Fiebig from comment #192)
> #136
> 
> It has been suggested that I/O-schedulers may play a role in this. So here's
> are my settings for 4.19.x for comparison. They deviate from yours in some
> points but I really don't know whether this has any relevance. You may want
> to give it a try anyway. As I've said, 4.19.x is a nice kernel here.
> 
> > grep -i sched .config_4.19-rc5 
> CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y
> CONFIG_CGROUP_SCHED=y
> CONFIG_FAIR_GROUP_SCHED=y
> CONFIG_RT_GROUP_SCHED=y
> CONFIG_SCHED_AUTOGROUP=y
> CONFIG_SCHED_OMIT_FRAME_POINTER=y
> CONFIG_SCHED_SMT=y
> CONFIG_SCHED_MC=y
> CONFIG_SCHED_MC_PRIO=y
> CONFIG_SCHED_HRTICK=y
> # CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL is not set
> # CONFIG_CPU_FREQ_GOV_SCHEDUTIL is not set
> # IO Schedulers
> CONFIG_IOSCHED_NOOP=y
> CONFIG_IOSCHED_DEADLINE=y
> CONFIG_IOSCHED_CFQ=y
> CONFIG_CFQ_GROUP_IOSCHED=y
> CONFIG_DEFAULT_IOSCHED="deadline"
> CONFIG_MQ_IOSCHED_DEADLINE=y
> CONFIG_MQ_IOSCHED_KYBER=y
> # CONFIG_IOSCHED_BFQ is not set
> CONFIG_NET_SCHED=y
> # Queueing/Scheduling
> CONFIG_USB_EHCI_TT_NEWSCHED=y
> CONFIG_SCHED_INFO=y
> CONFIG_SCHED_TRACER=y

Really, how come you say "these are your settings"?

The settings are, what is actually being used not what has ben compiled-in or I
miss anything?

What's the coincidence between 

CONFIG_DEFAULT_IOSCHED="deadline" + CONFIG_IOSCHED_DEADLINE=y

and

cat /sys/block/sda/queue/scheduler 
mq-deadline [kyber] bfq none

Please see #139 - wee need a list of what is effectively used and not what is
actually possible. Bare metal or not. Intel? AMD? hugepages or nor?

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (194 preceding siblings ...)
  2018-12-03 11:02 ` bugzilla-daemon
@ 2018-12-03 11:03 ` bugzilla-daemon
  2018-12-03 11:08 ` bugzilla-daemon
                   ` (74 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-03 11:03 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #195 from Rainer Fiebig (jrf@mailbox.org) ---
#193
>I am desperately trying to reproduce this in a qemu/KVM virtual machine with
>the configs >given by those users. But until now to no avail.

Good luck, Mr. Glück! ;)
My VirtualBox-VM seems immune to this issue. Perhaps VMs have just the right
"hardware".

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (195 preceding siblings ...)
  2018-12-03 11:03 ` bugzilla-daemon
@ 2018-12-03 11:08 ` bugzilla-daemon
  2018-12-03 11:09 ` bugzilla-daemon
                   ` (73 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-03 11:08 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #196 from Rainer Fiebig (jrf@mailbox.org) ---
#194

Hair-splitting won't help in this matter. 

And btw: if you're so smart - how come you haven't solved this already?

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (196 preceding siblings ...)
  2018-12-03 11:08 ` bugzilla-daemon
@ 2018-12-03 11:09 ` bugzilla-daemon
  2018-12-03 14:18 ` bugzilla-daemon
                   ` (72 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-03 11:09 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #197 from Marc Burkhardt (marc@osknowledge.org) ---
(In reply to Marc Burkhardt from comment #194)
> (In reply to Rainer Fiebig from comment #192)
> > #136
> > 
> > It has been suggested that I/O-schedulers may play a role in this. So
> here's
> > are my settings for 4.19.x for comparison. They deviate from yours in some
> > points but I really don't know whether this has any relevance. You may want
> > to give it a try anyway. As I've said, 4.19.x is a nice kernel here.
> > 
> > > grep -i sched .config_4.19-rc5 
> > CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y
> > CONFIG_CGROUP_SCHED=y
> > CONFIG_FAIR_GROUP_SCHED=y
> > CONFIG_RT_GROUP_SCHED=y
> > CONFIG_SCHED_AUTOGROUP=y
> > CONFIG_SCHED_OMIT_FRAME_POINTER=y
> > CONFIG_SCHED_SMT=y
> > CONFIG_SCHED_MC=y
> > CONFIG_SCHED_MC_PRIO=y
> > CONFIG_SCHED_HRTICK=y
> > # CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL is not set
> > # CONFIG_CPU_FREQ_GOV_SCHEDUTIL is not set
> > # IO Schedulers
> > CONFIG_IOSCHED_NOOP=y
> > CONFIG_IOSCHED_DEADLINE=y
> > CONFIG_IOSCHED_CFQ=y
> > CONFIG_CFQ_GROUP_IOSCHED=y
> > CONFIG_DEFAULT_IOSCHED="deadline"
> > CONFIG_MQ_IOSCHED_DEADLINE=y
> > CONFIG_MQ_IOSCHED_KYBER=y
> > # CONFIG_IOSCHED_BFQ is not set
> > CONFIG_NET_SCHED=y
> > # Queueing/Scheduling
> > CONFIG_USB_EHCI_TT_NEWSCHED=y
> > CONFIG_SCHED_INFO=y
> > CONFIG_SCHED_TRACER=y
> 
> Really, how come you say "these are your settings"?
> 
> The settings are, what is actually being used not what has ben compiled-in
> or I miss anything?
> 
> What's the coincidence between 
> 
> CONFIG_DEFAULT_IOSCHED="deadline" + CONFIG_IOSCHED_DEADLINE=y
> 
> and
> 
> cat /sys/block/sda/queue/scheduler 
> mq-deadline [kyber] bfq none
> 
> Please see #139 - wee need a list of what is effectively used and not what
> is actually possible. Bare metal or not. Intel? AMD? hugepages or nor?

I use an allegedly wrong compiler, I use 4.19.y, I use ext4 with and without
dm-crypt, I use a scheduler, .... and I am currently NOT affected by that bug
even running the tests that people say triggers the bug.

Just to make it clear again: I'm not a kernel dev, ok, but I use Linux for
along time and I'm willing to help out what setup is *not* affected. Maybe I do
something totally wrong here but I'm willing to help out as a gibe-back to the
community providing me the OS I use solely for 20+ years.

I think the discussion should (at this point) not gather around what your
kernel is *capable* of, but just what actually is set-up to trigger the bug.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (197 preceding siblings ...)
  2018-12-03 11:09 ` bugzilla-daemon
@ 2018-12-03 14:18 ` bugzilla-daemon
  2018-12-03 14:20 ` bugzilla-daemon
                   ` (71 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-03 14:18 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

Sebastian Jastrzebski (shopper2k@gmail.com) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |shopper2k@gmail.com

--- Comment #198 from Sebastian Jastrzebski (shopper2k@gmail.com) ---
I can also confirm the fs corruption issue on Fedora 29 with 4.19.5 kernel. I
run it on ThinkPad T480 with NVME Samsung drive. 

* Workload 

The workload involves doing a bunch of compile sessions and/or running a VM
(under KVM hypervisor) with NFS server. It usually takes anywhere from few
hours to a day for the corruption to occur.

* Symptoms

- /dev/nvm0n1* entries disappear from /dev/
- unable to start any program as i get I/O errors 

* System Info

> uname -a
Linux skyline.origin 4.19.5-300.fc29.x86_64 #1 SMP Tue Nov 27 19:29:23 UTC 2018
x86_64 x86_64 x86_64 GNU/Linux

> cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-4.19.5-300.fc29.x86_64 root=/dev/mapper/fedora_skyline-root
ro rd.lvm.lv=fedora_skyline/root
rd.luks.uuid=luks-b66e85a5-f7b1-4d87-8fab-a01687e35056
rd.lvm.lv=fedora_skyline/swap rhgb quiet LANG=en_US.UTF-8

> cat /sys/block/nvme0n1/queue/scheduler 
[none] mq-deadline

> lsblk 
NAME                                          MAJ:MIN RM   SIZE RO TYPE 
MOUNTPOINT
nvme0n1                                       259:0    0 238.5G  0 disk  
├─nvme0n1p1                                   259:1    0   200M  0 part 
/boot/efi
├─nvme0n1p2                                   259:2    0     1G  0 part  /boot
├─nvme0n1p3                                   259:3    0   160G  0 part  
│ └─luks-b66e85a5-f7b1-4d87-8fab-a01687e35056 253:0    0   160G  0 crypt 
│   ├─fedora_skyline-root                     253:1    0   156G  0 lvm   /
│   └─fedora_skyline-swap                     253:2    0     4G  0 lvm   [SWAP]
└─nvme0n1p4                                   259:4    0  77.3G  0 part  
  ├─skyline_vms-atomic_00                     253:3    0    20G  0 lvm   
  └─skyline_vms-win10_00                      253:4    0    40G  0 lvm   

This is dumpe2fs output on the currently booted system.

> dumpe2fs /dev/mapper/fedora_skyline-root
dumpe2fs 1.44.3 (10-July-2018)
Filesystem volume name:   <none>
Last mounted on:          /
Filesystem UUID:          410261f3-0779-455b-9642-d52800292fd7
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype
needs_recovery extent 64bit flex_bg sparse_super large_file h
uge_file uninit_bg dir_nlink extra_isize
Filesystem flags:         signed_directory_hash 
Default mount options:    user_xattr acl
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              10223616
Block count:              40894464
Reserved block count:     2044723
Free blocks:              26175785
Free inodes:              9255977
First block:              0
Block size:               4096
Fragment size:            4096
Group descriptor size:    64
Reserved GDT blocks:      1024
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         8192
Inode blocks per group:   512
Flex block group size:    16
Filesystem created:       Mon Feb 19 18:48:05 2018
Last mount time:          Mon Dec  3 08:07:30 2018
Last write time:          Mon Dec  3 03:07:29 2018
Mount count:              137
Maximum mount count:      -1
Last checked:             Sat Jul 14 07:11:08 2018
Check interval:           0 (<none>)
Lifetime writes:          1889 GB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               256
Required extra isize:     32
Desired extra isize:      32
Journal inode:            8
First orphan inode:       9318809
Default directory hash:   half_md4
Directory Hash Seed:      ad5a6f9c-6250-4dc5-84d9-4a3b14edc7b7
Journal backup:           inode blocks
Journal features:         journal_incompat_revoke journal_64bit
Journal size:             1024M
Journal length:           262144
Journal sequence:         0x00508e50
Journal start:            1

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (198 preceding siblings ...)
  2018-12-03 14:18 ` bugzilla-daemon
@ 2018-12-03 14:20 ` bugzilla-daemon
  2018-12-03 14:57 ` bugzilla-daemon
                   ` (70 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-03 14:20 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #199 from Ortwin Glück (odi@odi.ch) ---
(In reply to Rainer Fiebig from comment #195)
> My VirtualBox-VM seems immune to this issue. Perhaps VMs have just the right
> "hardware".

Yeah, I've been wondering. Qemu only exposes single-queue devices (virtio_blk)
so bugs in MQ can not trigger here I guess. Also "hardware" timing is much
different in VMs so race conditions may not trigger with the same frequency.
Also CPU assignment/scheduling may be different with respect to barriers, so
memory safety problems (RCU bugs, missing barriers) may behave differently.

I had no luck with vastly overcommitting vCPUs either.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (199 preceding siblings ...)
  2018-12-03 14:20 ` bugzilla-daemon
@ 2018-12-03 14:57 ` bugzilla-daemon
  2018-12-03 15:10 ` bugzilla-daemon
                   ` (69 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-03 14:57 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #200 from Ortwin Glück (odi@odi.ch) ---
Oh wow, qemu/KVM does support multi-queue disks!
-drive file=${DISK},cache=writeback,id=d0,if=none
-device virtio-blk-pci,drive=d0,num-queues=4

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (200 preceding siblings ...)
  2018-12-03 14:57 ` bugzilla-daemon
@ 2018-12-03 15:10 ` bugzilla-daemon
  2018-12-03 15:25 ` bugzilla-daemon
                   ` (68 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-03 15:10 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #201 from Eric Benoit (eric@ecks.ca) ---
Is it remotely possible this has to do with SPECTRE mitigation updates?

I'm assuming everyone has these enabled. Anyone with this issue that doesn't?
Have we seen it with AMD processors, or non-x86 even?

I should note the affected systems I've mentioned are Intel Core2 era. I
haven't been able to trigger it on an older AMD system, but that's using ext4.

I'll pull up my sleeves and put in some effort to sort this out later today.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (201 preceding siblings ...)
  2018-12-03 15:10 ` bugzilla-daemon
@ 2018-12-03 15:25 ` bugzilla-daemon
  2018-12-03 15:25 ` bugzilla-daemon
                   ` (67 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-03 15:25 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #202 from Guenter Roeck (linux@roeck-us.net) ---
#201: Various AMD and Intel CPUs are affected. Search for "AMD" and "Intel" in
this bug.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (202 preceding siblings ...)
  2018-12-03 15:25 ` bugzilla-daemon
@ 2018-12-03 15:25 ` bugzilla-daemon
  2018-12-03 16:20 ` bugzilla-daemon
                   ` (66 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-03 15:25 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #203 from Rainer Fiebig (jrf@mailbox.org) ---
#199
That it cannot be reproduced in VMs may still offer a clue, namely that it may
indeed have something to do with hardware and/or configuration.

In the end there has to be a discriminating factor between those systems that
have the problem and those that don't.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (203 preceding siblings ...)
  2018-12-03 15:25 ` bugzilla-daemon
@ 2018-12-03 16:20 ` bugzilla-daemon
  2018-12-03 17:01 ` bugzilla-daemon
                   ` (65 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-03 16:20 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #204 from Guenter Roeck (linux@roeck-us.net) ---
Another update:

Still working on a reliable reproducer. I have been able to reproduce the
problem again with v4.19.6 and the following patches reverted.

Revert "ext4: handle layout changes to pinned DAX mappings"
Revert "dax: remove VM_MIXEDMAP for fsdax and device dax"
Revert "ext4: close race between direct IO and ext4_break_layouts()"

I used a modified version of #185, running on each drive on the affected
system, plus a kernel build running in parallel.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (204 preceding siblings ...)
  2018-12-03 16:20 ` bugzilla-daemon
@ 2018-12-03 17:01 ` bugzilla-daemon
  2018-12-03 18:05 ` bugzilla-daemon
                   ` (64 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-03 17:01 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #205 from Rainer Fiebig (jrf@mailbox.org) ---
#204

So, only 2 commits left in your creative drill-down-effort, right? I hope the
huge amount of time you have invested pays off and puts an end to this crisis.

If so, it will still be interesting to know: Why only some and not all?

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (205 preceding siblings ...)
  2018-12-03 17:01 ` bugzilla-daemon
@ 2018-12-03 18:05 ` bugzilla-daemon
  2018-12-03 19:42 ` bugzilla-daemon
                   ` (63 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-03 18:05 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #206 from Jimmy.Jazz@gmx.net ---
fsck does i/o itself. It doesn't aggravate or trigger the issue. Also, I can't
believe it is just an hdd, sata or usb hardware problem.

Moreover, it affects other type of file systems zfs or nfs for instance (what
about ext3 and ext2 ?). So it should imply the code they share.

Three ideas I hope not so stupid.

+ log journal
On my computers, vmlinuz is written on an ext2 /boot file system every kernel
upgrade. If I remember well ext2 never failed and the file system stayed clean
during the tests.
If ext2 is not affected, it could involve the journal code instead.

+ if that's not the log, than the cache.
In my case, rsync and rm are involved in the file system corruption. It could
be explained like that. rsync reads inodes and blocs to compare before any
write. The kernel reports an inconsistency independently if the inode/bloc is
read or written from/to the cache. As expected only the changes are sent to the
media. It explains that some of the corruptions never reached the media and the
next reboot fsck declares a disk clean because only read i/o has been done
before the reboot.

+ what about synchronisation
As I mention in other posts, even if the issue still lurks, the patch proposed
here makes the issue less intrusive. 

My tests were made with a vanilla kernel source from gentoo portage
sys-kernel/vanilla-sources

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (206 preceding siblings ...)
  2018-12-03 18:05 ` bugzilla-daemon
@ 2018-12-03 19:42 ` bugzilla-daemon
  2018-12-03 19:56 ` bugzilla-daemon
                   ` (62 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-03 19:42 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #207 from Guenter Roeck (linux@roeck-us.net) ---
#205: Not really. Still working on the script - I'll publish it once it is even
more brutal - but I have now been able to reproduce the problem even with all
patches from #99 reverted.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (207 preceding siblings ...)
  2018-12-03 19:42 ` bugzilla-daemon
@ 2018-12-03 19:56 ` bugzilla-daemon
  2018-12-03 20:38 ` bugzilla-daemon
                   ` (61 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-03 19:56 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #208 from Guenter Roeck (linux@roeck-us.net) ---
Created attachment 279827
  --> https://bugzilla.kernel.org/attachment.cgi?id=279827&action=edit
Reproducer

To reproduce, run the attached script on each mounted file system. Also, run a
linux kernel build with as much parallelism as you dare. On top of that, run a
backup program such as borg. I don't know if this is all needed, but with all
that I am able to reproduce the problem quite reliably, for the most part
within a few minutes.

Typical log:

[  357.330900] EXT4-fs error (device sda1): ext4_iget:4795: inode #5519385:
comm borg: bad extra_isize 4752 (inode size 256)
[  357.351658] Aborting journal on device sda1-8.
[  357.355728] EXT4-fs error (device sda1) in ext4_reserve_inode_write:5805:
Journal has aborted
[  357.355745] EXT4-fs error (device sda1) in ext4_reserve_inode_write:5805:
Journal has aborted
[  357.365397] EXT4-fs (sda1): Remounting filesystem read-only
[  357.365942] EXT4-fs error (device sda1): ext4_iget:4795: inode #5519388:
comm borg: bad extra_isize 2128 (inode size 256)
[  357.366167] EXT4-fs error (device sda1): ext4_journal_check_start:61:
Detected aborted journal
[  357.371296] EXT4-fs error (device sda1): ext4_journal_check_start:61:
Detected aborted journal
[  357.375832] EXT4-fs error (device sda1): ext4_journal_check_start:61:
Detected aborted journal
[  357.382480] EXT4-fs error (device sda1): ext4_journal_check_start:61:
Detected aborted journal
[  357.382486] EXT4-fs (sda1): ext4_writepages: jbd2_start: 5114 pages, ino
5273647; err -30
[  357.384839] EXT4-fs error (device sda1): ext4_lookup:1578: inode #5513104:
comm borg: deleted inode referenced: 5519390
[  357.387331] EXT4-fs error (device sda1): ext4_iget:4795: inode #5519392:
comm borg: bad extra_isize 3 (inode size 256)
[  357.396557] EXT4-fs error (device sda1): ext4_journal_check_start:61:
Detected aborted journal
[  357.428824] EXT4-fs error (device sda1): ext4_journal_check_start:61:
Detected aborted journal
[  357.437008] EXT4-fs error (device sda1) in ext4_dirty_inode:5989: Journal
has aborted
[  357.441953] EXT4-fs error (device sda1) in ext4_dirty_inode:5989: Journal
has aborted

As you can see, it took just about six minutes after boot to see the problem.
Kernel version in this case is v4.19.6 with the five patches per #99 reverted.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (208 preceding siblings ...)
  2018-12-03 19:56 ` bugzilla-daemon
@ 2018-12-03 20:38 ` bugzilla-daemon
  2018-12-03 21:16 ` bugzilla-daemon
                   ` (60 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-03 20:38 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #209 from Néstor A. Marchesini (nestorm_des@hotmail.com) ---
I am investigating the dates of the files and folders found by fsck.ext4 when
repairing the partition and I find something surprising.

# ls -l /lost+foud
-rw-r--r-- 1 portage portage 5051 dic 10  2013 '#1057825'
drwxr-xr-x 3 portage portage 4096 dic 10  2013 '#1057827'
-rw-r--r-- 1 root    root    2022 oct 22 03:37 '#3184673'
-rw-r--r-- 1 root    root     634 oct 22 03:37 '#3184674'
-rw-r--r-- 1 root    root    1625 oct 22 03:37 '#3184675'

Many lost files appear on October 22 at 3:37hs, all with the same time and the
same day belonging to the root user, then a folder of December 10, 2013
belonging to the user portage group portage, for those who do not use gentoo,
only say that that user and that group is only in the system in /usr/portage

the contents of this folder:

# ls -l /lost+found/#1057827
drwxr-xr-x 11 portage portage 4096 dic 10  2013 vba

and inside that folder vba many more folders and files, all from the same ebuid
of libreoffice at the end of 2013, probably from this version, when it was
updated.

$ genlop -t libreoffice
     Fri Nov  1 00:34:35 2013 >>> app-office/libreoffice-4.1.3.2
       merge time: 1 hour, 9 minutes and 14 seconds.

This package for years that are no longer on my pc, when upgrade libreoffice
they were deleted and now fsck finds them when scanning as if they were
installed and they were corrupted, but it turns out that they were erased there
for a long time and now they are found as broken and put in lost+found.

So pay attention to the lost+found content of your partitions, to see if
they are current files or something they had long ago and had already deleted.
What I do not relate is because e2fsk.ext4 starts to detect these deleted
fragments.

It may be the journal of ext4 or one of its unsynchronized copies that
remembers things that are no longer there and retrieves them from the liberated
space?

My system and partitions were created on April 10, 2012 and I never had
corruption problems of this type.

$ genlop -t gentoo-sources |head -n3
     Wed Apr 11 23:39:02 2012 >>> sys-kernel/gentoo-sources-3.3.1

# tune2fs -l /dev/md2 |grep "Filesystem created:"
Filesystem created:       Tue Apr 10 16:18:28 2012

Regards

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (209 preceding siblings ...)
  2018-12-03 20:38 ` bugzilla-daemon
@ 2018-12-03 21:16 ` bugzilla-daemon
  2018-12-03 21:20 ` bugzilla-daemon
                   ` (59 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-03 21:16 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #210 from Rainer Fiebig (jrf@mailbox.org) ---
#207
So the problem seems more generic. Can you reproduce it now also on those
systems where you have *not* seen it yet?

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (210 preceding siblings ...)
  2018-12-03 21:16 ` bugzilla-daemon
@ 2018-12-03 21:20 ` bugzilla-daemon
  2018-12-03 22:19 ` bugzilla-daemon
                   ` (58 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-03 21:20 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #211 from Michel Roelofs (michel@michelroelofs.nl) ---
When I ran updatedb on 4.19.6, RETPOLINE disabled, I triggered within 2 minutes
the following errors (which I never saw with 4.14.x and older):
[  117.180537] BTRFS error (device dm-8): bad tree block start, want 614367232
have 23591879
[  117.222142] BTRFS info (device dm-8): read error corrected: ino 0 off
614367232 (dev /dev/mapper/linux-lxc sector 1216320)

And ~20 minutes later (while again running updatedb and compiling the kernel):
[ 1328.804705] EXT4-fs error (device dm-1): ext4_iget:4851: inode #7606807:
comm updatedb: checksum invalid

With debugfs I located the file of that inode, then I did an ls on it:
root@ster:# ls -l /home//michel/src/linux/linux/drivers/firmware/efi/test/
ls: cannot access '/home//michel/src/linux/linux/drivers/firmware/efi/test/':
Bad message
(reproduces)

Dropping dentry and inode cache (echo 2 > /proc/sys/vm/drop_caches) didn't
resolve this, but dropping all caches (echo 3 > /proc/sys/vm/drop_caches) did.

Both a simple 'ls' and also 'debugfs -R 'ncheck <inode>' did show errors, which
were resolved by the 'echo 3 > /proc/sys/vm/drop_caches'. See my comment #168
for 4.19.5.

My next step is to try without SMP. Does anybody have suggestions what else I
can try, or where I should look? What information to share?

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (211 preceding siblings ...)
  2018-12-03 21:20 ` bugzilla-daemon
@ 2018-12-03 22:19 ` bugzilla-daemon
  2018-12-04  1:41 ` bugzilla-daemon
                   ` (57 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-03 22:19 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

A Z (alzaagman@hotmail.com) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |alzaagman@hotmail.com

--- Comment #212 from A Z (alzaagman@hotmail.com) ---
I ran into this bug when using 4.19-4.19.5 and compiling Overwatch shaders.

Am no longer running into it with 4.19.6 after mounting my ext4 partition as
ext2.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (212 preceding siblings ...)
  2018-12-03 22:19 ` bugzilla-daemon
@ 2018-12-04  1:41 ` bugzilla-daemon
  2018-12-04  3:35 ` bugzilla-daemon
                   ` (56 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-04  1:41 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #213 from Guenter Roeck (linux@roeck-us.net) ---
#210: I rather refrain from it. Messing up one of my systems is bad enough.

Note that the problem is still spurious; it may happen a mionute into the test 
(or even during boot), or after an hour.

I am now at a point where I still see the problem with almost all patches since
4.18.20 reverted; the only patch not reverted is the STAT_WRITE patch, because
it is difficult to revert due to context changes. I'll revert that manually for
the next round of tests. Here is the most recent log:

[ 2228.782567] EXT4-fs error (device sda1): ext4_iget:4795: inode #6317073:
comm borg: bad extra_isize 30840 (inode size 256)
[ 2228.805645] Aborting journal on device sda1-8.
[ 2228.814576] EXT4-fs (sda1): Remounting filesystem read-only
[ 2228.815816] EXT4-fs error (device sda1): ext4_iget:4795: inode #6317074:
comm borg: bad extra_isize 30840 (inode size 256)
[ 2228.817360] EXT4-fs error (device sda1): ext4_journal_check_start:61:
Detected aborted journal
[ 2228.817367] EXT4-fs (sda1): ext4_writepages: jbd2_start: 4086 pages, ino
5310565; err -30
[ 2228.819221] EXT4-fs error (device sda1): ext4_journal_check_start:61:
Detected aborted journal
[ 2228.819227] EXT4-fs (sda1): ext4_writepages: jbd2_start: 9223372036854775745
pages, ino 5328193; err -30

... and so on.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (213 preceding siblings ...)
  2018-12-04  1:41 ` bugzilla-daemon
@ 2018-12-04  3:35 ` bugzilla-daemon
  2018-12-04  3:39 ` bugzilla-daemon
                   ` (55 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-04  3:35 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #214 from Hao Wei Tee (angelsl@in04.sg) ---
(In reply to Guenter Roeck from comment #213)
> I am now at a point where I still see the problem with almost all patches
> since 4.18.20 reverted; the only patch not reverted is the STAT_WRITE patch,
> because it is difficult to revert due to context changes. I'll revert that
> manually for the next round of tests.

Seems more and more likely that it's not a bug in ext4.. except perhaps some
changes in ext4 make it easier to run into the bug.

Do you think you'll be able to reproduce it with the 4.18 ext4?

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (214 preceding siblings ...)
  2018-12-04  3:35 ` bugzilla-daemon
@ 2018-12-04  3:39 ` bugzilla-daemon
  2018-12-04  6:04 ` bugzilla-daemon
                   ` (54 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-04  3:39 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #215 from Guenter Roeck (linux@roeck-us.net) ---
#214: Not yet. Still trying.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (215 preceding siblings ...)
  2018-12-04  3:39 ` bugzilla-daemon
@ 2018-12-04  6:04 ` bugzilla-daemon
  2018-12-04  7:06 ` bugzilla-daemon
                   ` (53 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-04  6:04 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #216 from Jukka Santala (donwulff@nic.fi) ---
This time the corrupt inode block came clearly from one of the JPG files I was
checksumming (without writing) with rsync at the same time. Poor test because I
was checking against BTRFS filesystem, so I don't know which fs the corrupt
block came from. Also first time I hit actual corruption with filesystem
mounted errors=remount-ro, somehow two blocks of inodes had multiply claimed
inodes. To me this suggests that the corrupting block came from another
reservation block, the kernel didn't notice that because the data structure was
valid and wrote it back. If so, this would indicate it happens inside single
filesystem and with metadata blocks as source as well.

It seems to me like metadata blocks are remaining linked when evicted due to
memory pressure. BTRFS csum errors probably from same source. Steps for
reproducing would be causing evictions in large pagecache while re-accessing
same inode blocks. Backup scripts do this when same block contains inodes
created at different times, ie. for me it happens constantly when reading files
in date-specific directories where files from different days are in same inode
block so the copy command re-reads the same block after some evictions. Likely
some race-condition in block reservation or the like, because otherwise it'd be
crashing all the time, but the corrupt block stays in the cache.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (216 preceding siblings ...)
  2018-12-04  6:04 ` bugzilla-daemon
@ 2018-12-04  7:06 ` bugzilla-daemon
  2018-12-04  8:24 ` bugzilla-daemon
                   ` (52 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-04  7:06 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #217 from Rainer Fiebig (jrf@mailbox.org) ---
#213
>#210: I rather refrain from it. Messing up one of my systems is bad enough.
Absolutely. 

You're doing a great job here!

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (217 preceding siblings ...)
  2018-12-04  7:06 ` bugzilla-daemon
@ 2018-12-04  8:24 ` bugzilla-daemon
  2018-12-04  8:41 ` bugzilla-daemon
                   ` (51 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-04  8:24 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #218 from Guenter Roeck (linux@roeck-us.net) ---
Oh well. It took a long time, but:

v4.19.6, with fs/ext4 from v4.18.20:

[15903.283340] EXT4-fs error (device sdb1): ext4_lookup:1578: inode #5137538:
comm updatedb.mlocat: deleted inode referenced: 5273882
[15903.284896] Aborting journal on device sdb1-8.
[15903.286404] EXT4-fs (sdb1): Remounting filesystem read-only

I guess the next step will be to test v4.18.20 with my test script, to make
sure that this is not a long-time lingering problem. Other than that, I am open
to ideas.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (218 preceding siblings ...)
  2018-12-04  8:24 ` bugzilla-daemon
@ 2018-12-04  8:41 ` bugzilla-daemon
  2018-12-04  8:44 ` bugzilla-daemon
                   ` (50 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-04  8:41 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #219 from Ortwin Glück (odi@odi.ch) ---
#211
> Dropping dentry and inode cache (echo 2 > /proc/sys/vm/drop_caches) didn't
> resolve this, but dropping all caches (echo 3 > /proc/sys/vm/drop_caches)
> did.

So we have pagecache corruption. Sounds like a problem in vm code then. Is
anybody seeing the problem when there is no swap?

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (219 preceding siblings ...)
  2018-12-04  8:41 ` bugzilla-daemon
@ 2018-12-04  8:44 ` bugzilla-daemon
  2018-12-04 10:16 ` bugzilla-daemon
                   ` (49 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-04  8:44 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #220 from carlphilippreh@gmail.com ---
(In reply to Ortwin Glück from comment #219)
> #211
> > Dropping dentry and inode cache (echo 2 > /proc/sys/vm/drop_caches) didn't
> > resolve this, but dropping all caches (echo 3 > /proc/sys/vm/drop_caches)
> > did.
> 
> So we have pagecache corruption. Sounds like a problem in vm code then. Is
> anybody seeing the problem when there is no swap?

I'm seeing this problem with and without swap. Two of the affected computers
even have CONFIG_SWAP=n.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (220 preceding siblings ...)
  2018-12-04  8:44 ` bugzilla-daemon
@ 2018-12-04 10:16 ` bugzilla-daemon
  2018-12-04 11:46 ` bugzilla-daemon
                   ` (48 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-04 10:16 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #221 from Rainer Fiebig (jrf@mailbox.org) ---
#218
If 4.18.20 turns out to be OK, my idea would be to bisect between 4.18 and
4.19.
Jimmy.Jazz has already done that and the result pointed to RCU. But IIRC it was
not a clear cut

> git bisect bad
xyz123 is the first bad commit

With your script we now have a tool to reproduce the problem which makes the
distinction between "good" and "bad" more reliable. And everybody is now also
aware how important it is to ensure that the fs is OK after a bad kernel has
run and that the next step should be done with a known-good kernel. So it
should be possible to identify a bad commit.

Perhaps one could limit the bisect to kernel/rcu or block in a first step. And
if that's inconclusive, extent the search.

But if 4.18.20 is bad, I have no clue at all - at least at the moment.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (221 preceding siblings ...)
  2018-12-04 10:16 ` bugzilla-daemon
@ 2018-12-04 11:46 ` bugzilla-daemon
  2018-12-04 15:03 ` bugzilla-daemon
                   ` (47 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-04 11:46 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

nclauzel@gmail.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |nclauzel@gmail.com

--- Comment #222 from nclauzel@gmail.com ---
Hello,

Thank you all for your great work in this investigation.

Just my 2 cents: as mentionned earlier by others, I think it is closely related
to rsync. Or at least it is a good way to reproduce. 

On my machine I had the issue very often when my rsync script was activated at
login. Since I deactivated this task it looks fine so far. 
I could easily see some issues in my rsync log file, even a text editor was
reporting issues on this file.

Hope this helps to find a fix quicker!

Let me know if you need more information from my side.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (222 preceding siblings ...)
  2018-12-04 11:46 ` bugzilla-daemon
@ 2018-12-04 15:03 ` bugzilla-daemon
  2018-12-04 17:52 ` bugzilla-daemon
                   ` (46 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-04 15:03 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #223 from Guenter Roeck (linux@roeck-us.net) ---
4.18.20 seems to be ok, except that my script overburdens it a bit.

[ 1088.450369] INFO: task systemd-tmpfile:31954 blocked for more than 120
seconds.
[ 1088.450374]       Not tainted 4.18.20+ #1
[ 1088.450375] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
[ 1088.450377] systemd-tmpfile D    0 31954      1 0x00000000
[ 1088.450380] Call Trace:
[ 1088.450389]  __schedule+0x3f1/0x8c0
[ 1088.450392]  ? bit_wait+0x60/0x60
[ 1088.450394]  schedule+0x36/0x80
[ 1088.450398]  io_schedule+0x16/0x40
[ 1088.450400]  bit_wait_io+0x11/0x60
[ 1088.450403]  __wait_on_bit+0x63/0x90
[ 1088.450405]  out_of_line_wait_on_bit+0x8e/0xb0
[ 1088.450408]  ? init_wait_var_entry+0x50/0x50
[ 1088.450411]  __wait_on_buffer+0x32/0x40
[ 1088.450414]  __ext4_get_inode_loc+0x19f/0x3e0
[ 1088.450416]  ext4_iget+0x8f/0xc30

There is a key difference, though: With 4.18.20, cfq is active.

$ cat /sys/block/sd*/queue/scheduler
noop deadline [cfq] 
noop deadline [cfq] 

In 4.19, cfq is not available due to commit d5038a13eca72 ("scsi: core: switch
to scsi-mq by default"). I'll repeat my tests with SCSI_MQ_DEFAULT disabled on
v4.19, and with it enabled on v4.18.20. We know that disabling SCSI_MQ_DEFAULT
alone does not help, but maybe there is more than one problem.

Any takers for a round of bisects as suggested in #221 ?

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (223 preceding siblings ...)
  2018-12-04 15:03 ` bugzilla-daemon
@ 2018-12-04 17:52 ` bugzilla-daemon
  2018-12-04 18:04 ` bugzilla-daemon
                   ` (45 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-04 17:52 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #224 from Manfred (Manfred.Knick@T-Online.de) ---
(In reply to Guenter Roeck from comment #223)

> There is a key difference, though: With 4.18.20, cfq is active.
> 
> $ cat /sys/block/sd*/queue/scheduler
> noop deadline [cfq] 
> noop deadline [cfq] 
> 
> In 4.19, cfq is not available    <--- ?

# grep  -i cfq    config-4.19.3-gentoo

CONFIG_IOSCHED_CFQ=y                       < --- !
# CONFIG_CFQ_GROUP_IOSCHED is not set
CONFIG_DEFAULT_CFQ=y
CONFIG_DEFAULT_IOSCHED="cfq"

did not run me into trouble.

Being a production machine, reverted it to

### INFO:

# uname -a
Linux XXX 4.18.20-gentoo #1 SMP Wed Nov 28 12:30:28 CET 2018 
x86_64 Intel(R) Xeon(R) CPU E3-1276 v3 @ 3.60GHz GenuineIntel GNU/Linux

Running Gentoo "stable" (with *very* few exceptions)

# equery list gcc
[IP-] [  ] sys-devel/gcc-7.3.0-r3:7.3.0

Exploiting disks directly attached @ ASUS Workstation MoBo P9D-WS :
   Samsung SSD,
   multiple S-ATA HDD fom 5000 GB up to 6 TB, ...
as well as e.g. Adaptec SCSI Raid-1, 2 x WD 500

running stable till this evening.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (224 preceding siblings ...)
  2018-12-04 17:52 ` bugzilla-daemon
@ 2018-12-04 18:04 ` bugzilla-daemon
  2018-12-04 18:09 ` bugzilla-daemon
                   ` (44 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-04 18:04 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #225 from Steven Noonan (steven@uplinklabs.net) ---
(In reply to Manfred from comment #224)
> (In reply to Guenter Roeck from comment #223)
> 
> > There is a key difference, though: With 4.18.20, cfq is active.
> > 
> > $ cat /sys/block/sd*/queue/scheduler
> > noop deadline [cfq] 
> > noop deadline [cfq] 
> > 
> > In 4.19, cfq is not available    <--- ?
> 
> # grep  -i cfq    config-4.19.3-gentoo
> 
> CONFIG_IOSCHED_CFQ=y                       < --- !
> # CONFIG_CFQ_GROUP_IOSCHED is not set
> CONFIG_DEFAULT_CFQ=y
> CONFIG_DEFAULT_IOSCHED="cfq"
> 

When scsi_mod.use_blk_mq=1 (i.e. result of CONFIG_SCSI_MQ_DEFAULT=y), the I/O
scheduler is just "none", and you cannot set a different scheduler.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (225 preceding siblings ...)
  2018-12-04 18:04 ` bugzilla-daemon
@ 2018-12-04 18:09 ` bugzilla-daemon
  2018-12-04 18:14 ` bugzilla-daemon
                   ` (43 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-04 18:09 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #226 from Manfred (Manfred.Knick@T-Online.de) ---
(In reply to Steven Noonan from comment #225)

Thanks for pointing this out  -  forgot to mention:

# grep CONFIG_SCSI_MQ_DEFAULT config-4.19.3-gentoo

# CONFIG_SCSI_MQ_DEFAULT is not set                     < ---

HTH

Respectfully

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (226 preceding siblings ...)
  2018-12-04 18:09 ` bugzilla-daemon
@ 2018-12-04 18:14 ` bugzilla-daemon
  2018-12-04 18:22 ` bugzilla-daemon
                   ` (42 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-04 18:14 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #227 from Jens Axboe (axboe@kernel.dk) ---
(In reply to Steven Noonan from comment #225)
> When scsi_mod.use_blk_mq=1 (i.e. result of CONFIG_SCSI_MQ_DEFAULT=y), the
> I/O scheduler is just "none", and you cannot set a different scheduler.

That's not true, you can set MQ capable schedulers. CFQ is from the legacy
stack, it doesn't support MQ. But you can set none/bfq/mq-deadline/kyber, for
instance.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (227 preceding siblings ...)
  2018-12-04 18:14 ` bugzilla-daemon
@ 2018-12-04 18:22 ` bugzilla-daemon
  2018-12-04 18:29 ` bugzilla-daemon
                   ` (41 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-04 18:22 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #228 from Guenter Roeck (linux@roeck-us.net) ---
I guess I should have been more specific. With CONFIG_SCSI_MQ_DEFAULT=y (or
scsi_mod.use_blk_mq=1), cfq is not available. That applies to any kernel
version with CONFIG_SCSI_MQ_DEFAULT=y (or scsi_mod.use_blk_mq=1), not just to
4.19, and it doesn't apply to 4.19 if CONFIG_SCSI_MQ_DEFAULT=n (or
scsi_mod.use_blk_mq=0).

It is quite irrelevant if other schedulers are available if
CONFIG_SCSI_MQ_DEFAULT=y (or scsi_mod.use_blk_mq=1). cfq is not available, and
it doesn't matter if it is set as default or not.

I hope this is specific enough this time. My apologies if I missed some other
means to enable or disable blk_mq.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (228 preceding siblings ...)
  2018-12-04 18:22 ` bugzilla-daemon
@ 2018-12-04 18:29 ` bugzilla-daemon
  2018-12-04 18:33 ` bugzilla-daemon
                   ` (40 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-04 18:29 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #229 from Jens Axboe (axboe@kernel.dk) ---
(In reply to Guenter Roeck from comment #228)
> I guess I should have been more specific. With CONFIG_SCSI_MQ_DEFAULT=y (or
> scsi_mod.use_blk_mq=1), cfq is not available. That applies to any kernel
> version with CONFIG_SCSI_MQ_DEFAULT=y (or scsi_mod.use_blk_mq=1), not just
> to 4.19, and it doesn't apply to 4.19 if CONFIG_SCSI_MQ_DEFAULT=n (or
> scsi_mod.use_blk_mq=0).
> 
> It is quite irrelevant if other schedulers are available if
> CONFIG_SCSI_MQ_DEFAULT=y (or scsi_mod.use_blk_mq=1). cfq is not available,
> and it doesn't matter if it is set as default or not.
> 
> I hope this is specific enough this time. My apologies if I missed some
> other means to enable or disable blk_mq.

My clarification was for Steven, not you.

In terms of scheduler, CFQ will change the patterns a lot. For the non-mq case,
I'd recommend using noop or deadline for testing, otherwise I fear we're
testing a lot more than mq vs non-mq.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (229 preceding siblings ...)
  2018-12-04 18:29 ` bugzilla-daemon
@ 2018-12-04 18:33 ` bugzilla-daemon
  2018-12-04 18:34 ` bugzilla-daemon
                   ` (39 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-04 18:33 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #230 from Jens Axboe (axboe@kernel.dk) ---
Guenter, can you attach the .config you are running with?

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (230 preceding siblings ...)
  2018-12-04 18:33 ` bugzilla-daemon
@ 2018-12-04 18:34 ` bugzilla-daemon
  2018-12-04 18:37 ` bugzilla-daemon
                   ` (38 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-04 18:34 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #231 from Steven Noonan (steven@uplinklabs.net) ---
(In reply to Jens Axboe from comment #227)
> (In reply to Steven Noonan from comment #225)
> > When scsi_mod.use_blk_mq=1 (i.e. result of CONFIG_SCSI_MQ_DEFAULT=y), the
> > I/O scheduler is just "none", and you cannot set a different scheduler.
> 
> That's not true, you can set MQ capable schedulers. CFQ is from the legacy
> stack, it doesn't support MQ. But you can set none/bfq/mq-deadline/kyber,
> for instance.

My bad. I was basing my response on outdated information:

https://mahmoudhatem.wordpress.com/2016/02/08/oracle-uek-4-where-is-my-io-scheduler-none-multi-queue-model-blk-mq/

(Also didn't want to risk turning on MQ on one of my machines just to word my
response, especially if not having CFQ is somehow involved in this corruption
bug!)

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (231 preceding siblings ...)
  2018-12-04 18:34 ` bugzilla-daemon
@ 2018-12-04 18:37 ` bugzilla-daemon
  2018-12-04 18:37 ` bugzilla-daemon
                   ` (37 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-04 18:37 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

Lukáš Krejčí (lskrejci@gmail.com) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |lskrejci@gmail.com

--- Comment #232 from Lukáš Krejčí (lskrejci@gmail.com) ---
Created attachment 279845
  --> https://bugzilla.kernel.org/attachment.cgi?id=279845&action=edit
git bisect between v4.18 and 4.19-rc1

Hello,

I am able to reproduce the data corruption under Qemu, the issue usually shows
itself fairly quickly (within a minute or two). Generally, the bug was very
likely to appear when (un)installing packages with apt.

I ran a bisect with the following result (full bisect log is attached):
# first bad commit: [6ce3dd6eec114930cf2035a8bcb1e80477ed79a8] blk-mq: issue
directly if hw queue isn't busy in case of 'none'

You can revert the commit from linux v4.19 with: git revert --no-commit
8824f62246bef 6ce3dd6eec114 (did not try compiling and running the kernel
myself yet)

Obviously, this commit could just make the issue more prominent than it already
is, especially since some are saying that CONFIG_SCSI_MQ_DEFAULT=n does not
make the problem go away. The commit was added fairly early in the 4.19 merge
window, though, so if v4.18 is fine, it should be one of the 67 other commits
in that range.
The only thing I can think of is that the people that had blk-mq off in the
kernel config still had it enabled on the kernel command line
(scsi_mod.use_blk_mq=1, /sys/module/scsi_mod/parameters/use_blk_mq would then
be set to Y).

The bad commits in the bisect log I am fairly certain of because the corruption
was evident, the good ones less so since I did only limited testing (about 3-6
VM restarts and couple minutes of running apt) and did not use the reproducer
script posted here.

There are a few preconditions that make the errors much more likely to appear:
- Ubuntu Desktop 18.10; Ubuntu Server 18.10 did not work (I guess there are a
few more things installed by default like Snap packages that are mounted on
startup, dpkg automatically searches for updates, etc.)
- as little RAM as possible (300 MB), 256 MB did not boot - this makes sure
swap is used (~200 MiB out of 472 MiB total)
- drive has to be the default if=ide, virtio-blk (-drive <...>,if=virtio) and
virtio-scsi (-drive file=<file>,media=disk,if=none,id=hd -device
virtio-scsi-pci,id=scsi -device scsi-hd,drive=hd) did not produce corruption (I
did not try setting num-queues, though)
- scsi_mod.use_blk_mq=1 has to be used, no errors for me without it (Ubuntu
mainline kernel 4.19.1 and later has this on by default)

Before running the bisect, I tested these kernels (all Ubuntu mainline from
http://kernel.ubuntu.com/~kernel-ppa/mainline/):

Had FS corruption:
4.19-rc1
4.19
4.19.1
4.19.2
4.19.3
4.19.4
4.19.5
4.19.6

No corruption (yet):
4.18
4.18.20

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (232 preceding siblings ...)
  2018-12-04 18:37 ` bugzilla-daemon
@ 2018-12-04 18:37 ` bugzilla-daemon
  2018-12-04 18:47 ` bugzilla-daemon
                   ` (36 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-04 18:37 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #233 from Lukáš Krejčí (lskrejci@gmail.com) ---
Created attachment 279847
  --> https://bugzilla.kernel.org/attachment.cgi?id=279847&action=edit
description of my Qemu and Ubuntu configuration

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (233 preceding siblings ...)
  2018-12-04 18:37 ` bugzilla-daemon
@ 2018-12-04 18:47 ` bugzilla-daemon
  2018-12-04 18:48 ` bugzilla-daemon
                   ` (35 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-04 18:47 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #234 from Jens Axboe (axboe@kernel.dk) ---
That's awesome, that makes some sense, finally! There's a later fix for that
one that is also in 4.19, but I guess that doesn't fix every failure case.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (234 preceding siblings ...)
  2018-12-04 18:47 ` bugzilla-daemon
@ 2018-12-04 18:48 ` bugzilla-daemon
  2018-12-04 18:59 ` bugzilla-daemon
                   ` (34 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-04 18:48 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #235 from Jens Axboe (axboe@kernel.dk) ---
I'm going to run your qemu config and see if I can reproduce, then a real fix
should be imminent.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (235 preceding siblings ...)
  2018-12-04 18:48 ` bugzilla-daemon
@ 2018-12-04 18:59 ` bugzilla-daemon
  2018-12-04 19:25 ` bugzilla-daemon
                   ` (33 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-04 18:59 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #236 from Guenter Roeck (linux@roeck-us.net) ---
Excellent. Finally getting somewhere. FWIW, I am not able to reproduce the
problem (anymore) with v4.19.6 and SCSI_MQ_DEFAULT=n. At this point I am not
sure if my earlier test that saw it failing was a false positive. I'll try with
the two reverts suggested in #232  next.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (236 preceding siblings ...)
  2018-12-04 18:59 ` bugzilla-daemon
@ 2018-12-04 19:25 ` bugzilla-daemon
  2018-12-04 20:25 ` bugzilla-daemon
                   ` (32 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-04 19:25 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #237 from Rainer Fiebig (jrf@mailbox.org) ---
(In reply to Lukáš Krejčí from comment #232)
> Created attachment 279845 [details]
> git bisect between v4.18 and 4.19-rc1
> 
> Hello,
> 
> I am able to reproduce the data corruption under Qemu, the issue usually
> shows itself fairly quickly (within a minute or two). Generally, the bug was
> very likely to appear when (un)installing packages with apt.
> 
> I ran a bisect with the following result (full bisect log is attached):
> # first bad commit: [6ce3dd6eec114930cf2035a8bcb1e80477ed79a8] blk-mq: issue
> directly if hw queue isn't busy in case of 'none'
> 
[...]

Congrats! Good to see progress here. Thanks!

I also feel somewhat vindicated as my idea to catch and bisect this in VM
wasn't so bad after all. ;)

But obviously qemu has more knobs to turn than VB - and you just turned the
right ones. Great!

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (237 preceding siblings ...)
  2018-12-04 19:25 ` bugzilla-daemon
@ 2018-12-04 20:25 ` bugzilla-daemon
  2018-12-04 20:36 ` bugzilla-daemon
                   ` (31 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-04 20:25 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #238 from Rainer Fiebig (jrf@mailbox.org) ---
#232
That you could see the errors so early and reliably really baffles me.

This very morning I *concurrently*
- ran Guenter's script for 30 minutes
- compiled a kernel
- did some file copying

with CONFIG_SCSI_MQ_DEFAULT=y and didn't see one error. 

But it is a Debian-8-VM, 1024 GB RAM and the two discs attached as SATA/SSD. I
guess you must have played around with the settings for a while - or did you
have the idea of limiting RAM and attaching the disc as IDE right from the
start? Anyway - great that you found this out!

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (238 preceding siblings ...)
  2018-12-04 20:25 ` bugzilla-daemon
@ 2018-12-04 20:36 ` bugzilla-daemon
  2018-12-04 20:59 ` bugzilla-daemon
                   ` (30 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-04 20:36 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #239 from Marc Burkhardt (marc@osknowledge.org) ---
Could anyone just sum up what needs to be set to trigger the bug (as of the
understanding we have now)?

I use

scsi_mod.use_blk_mq=y

and

dm_mod.use_blk_mq=y

for ages but I do not see the bug. I use the

mq-deadline

scheduler.

#232 somehow suggests it needs additional memory pressure to trigger it,
doesn't it?

Quite confused here...

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (239 preceding siblings ...)
  2018-12-04 20:36 ` bugzilla-daemon
@ 2018-12-04 20:59 ` bugzilla-daemon
  2018-12-04 21:01 ` bugzilla-daemon
                   ` (29 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-04 20:59 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #240 from Guenter Roeck (linux@roeck-us.net) ---
As mentioned earlier, I only ever saw the problem on two of four systems (see
#57), all running the same kernel and the same version of Ubuntu. The only
differences are mainboard, CPU, and attached drive types.

I don't think we know for sure what it takes to trigger the problem. We have
seen various guesses, from gcc version to l1tf mitigation to CPU type, broken
hard drives, and whatnot. At this time evidence points to the block subsystem,
with bisect pointing to a commit which relies on the state of the HW queue
(empty or not) in conjunction with the 'none' io scheduler. This may suggest
that drive speed and access timing may be involved. That guess may of course be
just as wrong as all the others.

Let's just hope that Jens will be able to track down and fix the problem. Then
we may be able to get a better idea what it actually takes to trigger it.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (240 preceding siblings ...)
  2018-12-04 20:59 ` bugzilla-daemon
@ 2018-12-04 21:01 ` bugzilla-daemon
  2018-12-04 21:11 ` bugzilla-daemon
                   ` (28 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-04 21:01 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #241 from Guenter Roeck (linux@roeck-us.net) ---
Oh, and if commit 6ce3dd6ee is indeed the culprit, you won't be able to trigger
the problem with mq-deadline (or any other scheduler) active.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (241 preceding siblings ...)
  2018-12-04 21:01 ` bugzilla-daemon
@ 2018-12-04 21:11 ` bugzilla-daemon
  2018-12-04 21:17 ` bugzilla-daemon
                   ` (27 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-04 21:11 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #242 from Jens Axboe (axboe@kernel.dk) ---
Progress report - I've managed to reproduce it now, following the procedure
from Lukáš Krejčí.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (242 preceding siblings ...)
  2018-12-04 21:11 ` bugzilla-daemon
@ 2018-12-04 21:17 ` bugzilla-daemon
  2018-12-04 21:37 ` bugzilla-daemon
                   ` (26 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-04 21:17 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #243 from Han Vinke (jfavinke@gmail.com) ---
I have a PC with multiple boot -Windows 10, Arch linux with kernel 4.19.x en
Ubuntu Disco Dingo with 4.19.x. My Arch linux is an encrypted LVM. 

I can actually invoke the EXT4-fs errors on Ubuntu!
Which is not encrypted but has cryptsetup-initramfs installed, because I make
regular backups with partclone from the Arch partitions.

All that is needed on Ubuntu is to run sudo update-initramfs -u

cryptsetup: WARNING: The initramfs image may not contain cryptsetup binaries
nor crypto modules. If that's on purpose, you may want to uninstall the
'crypsetup-initramfs' package in order to disable the cryptsetup initramfs
integration and avoid this warning.

You will get a warning or error that is also subscribed in this bugreport:

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=901830

The problem is that when you reboot you will get all the EXT4-fs errors.
I have to do a e2fsck via Arch, it reports some inode errors and when rebooting
Ubuntu the problem is gone, if there never were any problems.

Also when I am on ARCH cloning the Ubuntu partitions I can reproduce these
errors.
When mistakenly partcloning a read-only mounted partition for instance.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (243 preceding siblings ...)
  2018-12-04 21:17 ` bugzilla-daemon
@ 2018-12-04 21:37 ` bugzilla-daemon
  2018-12-04 21:43 ` bugzilla-daemon
                   ` (25 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-04 21:37 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #244 from Guenter Roeck (linux@roeck-us.net) ---
So far I have not been able to reproduce the problem on the affected systems
after reverting commits 8824f62246 and 6ce3dd6eec.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (244 preceding siblings ...)
  2018-12-04 21:37 ` bugzilla-daemon
@ 2018-12-04 21:43 ` bugzilla-daemon
  2018-12-04 21:50 ` bugzilla-daemon
                   ` (24 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-04 21:43 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #245 from Jens Axboe (axboe@kernel.dk) ---
I've only reproduced it that one time, but here's what I think is happening:

- Usually a request is queued, inserted into the blk-mq proper

- There's an optimization in place to attempt to issue to the driver directly
before doing that insert. If we fail because of some resource limitation, we
insert the request into blk-mq proper

- But if that failure did trigger, SCSI has already setup the command. This
means we now have a request in the regular blk-mq IO lists that is mergeable
with other commands, but where the SG tables for IO have already been setup.

- If we later do merge with this IO before dispatch, we'll only do DMA to the
original part of the request. This makes the rest very unhappy...

The case is different because from normal dispatch, if IO needs to be requeued,
it will NEVER be merged/changed after the fact. This means that we don't have
to release SG tables/mappings, we can simply reissue later.

This is just a theory... If I could reproduce more reliably, I'd verify it. I'm
going to spin a quick patch.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (245 preceding siblings ...)
  2018-12-04 21:43 ` bugzilla-daemon
@ 2018-12-04 21:50 ` bugzilla-daemon
  2018-12-04 21:59 ` bugzilla-daemon
                   ` (23 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-04 21:50 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #246 from Michel Roelofs (michel@michelroelofs.nl) ---
With 4.19.6, setting CONFIG_SCSI_MQ_DEFAULT=n seems to resolve the issue on my
system, going back to CONFIG_SCSI_MQ_DEFAULT=y makes it show up again. Indeed
all schedulers in /sys/devices/virtual/block/*/queue/scheduler are none.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (246 preceding siblings ...)
  2018-12-04 21:50 ` bugzilla-daemon
@ 2018-12-04 21:59 ` bugzilla-daemon
  2018-12-04 22:04 ` bugzilla-daemon
                   ` (22 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-04 21:59 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #247 from Guenter Roeck (linux@roeck-us.net) ---
#245: Is there a means to log the possible error case ?

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (247 preceding siblings ...)
  2018-12-04 21:59 ` bugzilla-daemon
@ 2018-12-04 22:04 ` bugzilla-daemon
  2018-12-04 22:06 ` bugzilla-daemon
                   ` (21 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-04 22:04 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

Chris Severance (linuxkernel.severach@spamgourmet.com) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |linuxkernel.severach@spamgo
                   |                            |urmet.com

--- Comment #248 from Chris Severance (linuxkernel.severach@spamgourmet.com) ---
Created attachment 279851
  --> https://bugzilla.kernel.org/attachment.cgi?id=279851&action=edit
Triggered by mdraid

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (248 preceding siblings ...)
  2018-12-04 22:04 ` bugzilla-daemon
@ 2018-12-04 22:06 ` bugzilla-daemon
  2018-12-04 22:08 ` bugzilla-daemon
                   ` (20 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-04 22:06 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #249 from Jens Axboe (axboe@kernel.dk) ---
Created attachment 279853
  --> https://bugzilla.kernel.org/attachment.cgi?id=279853&action=edit
4.19 fix

Here's a fix, verifying it now. It might be better to fully unprep the request
after the direct issue fails, but this one should be the safe no-brainer. And
at times like this, that feels prudent...

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (249 preceding siblings ...)
  2018-12-04 22:06 ` bugzilla-daemon
@ 2018-12-04 22:08 ` bugzilla-daemon
  2018-12-04 22:11 ` bugzilla-daemon
                   ` (19 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-04 22:08 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

Chris Severance (linuxkernel.severach@spamgourmet.com) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
 Attachment #279851|Triggered by mdraid         |dmesg with mdraid1
        description|                            |

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (250 preceding siblings ...)
  2018-12-04 22:08 ` bugzilla-daemon
@ 2018-12-04 22:11 ` bugzilla-daemon
  2018-12-04 22:12 ` bugzilla-daemon
                   ` (18 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-04 22:11 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #250 from Jens Axboe (axboe@kernel.dk) ---
(In reply to Chris Severance from comment #248)
> Created attachment 279851 [details]
> Triggered by mdraid

Unrelated issue.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (251 preceding siblings ...)
  2018-12-04 22:11 ` bugzilla-daemon
@ 2018-12-04 22:12 ` bugzilla-daemon
  2018-12-04 22:17 ` bugzilla-daemon
                   ` (17 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-04 22:12 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #251 from Guenter Roeck (linux@roeck-us.net) ---
#248: Looks like you may be running my reproducer script. It tends to do that,
especially on slow drives (ie anything but nvme), depending on the io scheduler
used. I have seen it with "cfq", but not with "none". iostat would probably
show you 90+ % iowait when it happens. That by itself does not indicate the
error we are trying to track down here.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (252 preceding siblings ...)
  2018-12-04 22:12 ` bugzilla-daemon
@ 2018-12-04 22:17 ` bugzilla-daemon
  2018-12-04 22:32 ` bugzilla-daemon
                   ` (16 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-04 22:17 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #252 from Jens Axboe (axboe@kernel.dk) ---
(In reply to Guenter Roeck from comment #247)
> #245: Is there a means to log the possible error case ?

Yes, that's how I ended up verifying that this was indeed what was going on.
Example:

[  235.665576] issue_direct=13, 22080, ffff9ee3da59e400
[  235.931483] bio_attempt_back_merge: MERGE ON NO PREP ffff9ee3da59e400
[  235.931486] bio_attempt_back_merge: MERGE ON NO PREP ffff9ee3da59e400
[  235.931489] bio_attempt_back_merge: MERGE ON NO PREP ffff9ee3da59e400
[  235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: comm
dpkg-query: bad extra_isize 24937 (inode size 256)

Here we see req 0xffff9ee3da59e400 being rejected, due to resource starvation.
Shortly thereafter, we see us happily merging more IO into that request. Once
that request finishes, ext4 gets immediately unhappy as only a small part of
the request contents were valid. The rest simply contained garbage.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (253 preceding siblings ...)
  2018-12-04 22:17 ` bugzilla-daemon
@ 2018-12-04 22:32 ` bugzilla-daemon
  2018-12-04 22:35 ` bugzilla-daemon
                   ` (15 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-04 22:32 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

Jens Axboe (axboe@kernel.dk) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
 Attachment #279853|0                           |1
        is obsolete|                            |

--- Comment #253 from Jens Axboe (axboe@kernel.dk) ---
Created attachment 279855
  --> https://bugzilla.kernel.org/attachment.cgi?id=279855&action=edit
4.19 patch v2

Better version of the previous patch. They both solve the issue, but the latter
version seems safer since it doesn't rely on whatever state that SCSI happens
to maintain. If we fail direct dispatch, don't ever touch the request before
dispatch.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (254 preceding siblings ...)
  2018-12-04 22:32 ` bugzilla-daemon
@ 2018-12-04 22:35 ` bugzilla-daemon
  2018-12-04 22:50 ` bugzilla-daemon
                   ` (14 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-04 22:35 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

Conrad Kostecki (ck+kernelbugzilla@bl4ckb0x.de) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |ck+kernelbugzilla@bl4ckb0x.
                   |                            |de

--- Comment #254 from Conrad Kostecki (ck+kernelbugzilla@bl4ckb0x.de) ---
(In reply to Michel Roelofs from comment #246)
> With 4.19.6, setting CONFIG_SCSI_MQ_DEFAULT=n seems to resolve the issue on
> my system, going back to CONFIG_SCSI_MQ_DEFAULT=y makes it show up again.
> Indeed all schedulers in /sys/devices/virtual/block/*/queue/scheduler are
> none.

I can confirm here.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (255 preceding siblings ...)
  2018-12-04 22:35 ` bugzilla-daemon
@ 2018-12-04 22:50 ` bugzilla-daemon
  2018-12-04 23:28 ` bugzilla-daemon
                   ` (13 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-04 22:50 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

Jens Axboe (axboe@kernel.dk) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
 Attachment #279855|0                           |1
        is obsolete|                            |

--- Comment #255 from Jens Axboe (axboe@kernel.dk) ---
Created attachment 279857
  --> https://bugzilla.kernel.org/attachment.cgi?id=279857&action=edit
4.19/4.20 patch v3

Here's the one I sent upstream, also tested this one. Should be the safest of
them all, as we set REQ_NOMERGE at the source of when we attempt to queue it.
That'll cover all cases, guaranteed.

The folks that have seen this, please try this patch on top of 4.19 or 4.20-rc
and report back.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (256 preceding siblings ...)
  2018-12-04 22:50 ` bugzilla-daemon
@ 2018-12-04 23:28 ` bugzilla-daemon
  2018-12-04 23:38 ` bugzilla-daemon
                   ` (12 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-04 23:28 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #256 from Eric Benoit (eric@ecks.ca) ---
Over an hour now on 4.19.0 with commits 8824f62246 and 6ce3dd6eec reverted, and
ZFS is happy. Plenty of IO and not a single checksum error.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (257 preceding siblings ...)
  2018-12-04 23:28 ` bugzilla-daemon
@ 2018-12-04 23:38 ` bugzilla-daemon
  2018-12-04 23:50 ` bugzilla-daemon
                   ` (11 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-04 23:38 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #257 from Jens Axboe (axboe@kernel.dk) ---
(In reply to Eric Benoit from comment #256)
> Over an hour now on 4.19.0 with commits 8824f62246 and 6ce3dd6eec reverted,
> and ZFS is happy. Plenty of IO and not a single checksum error.

Can you try 4.19.x with the patch from comment #255? Thanks!

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (258 preceding siblings ...)
  2018-12-04 23:38 ` bugzilla-daemon
@ 2018-12-04 23:50 ` bugzilla-daemon
  2018-12-05  1:08 ` bugzilla-daemon
                   ` (10 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-04 23:50 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #258 from Guenter Roeck (linux@roeck-us.net) ---
So far 30 minutes on two system running the patch from #255. Prior to that,
close to two hours running patch v1. No issues so far.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (259 preceding siblings ...)
  2018-12-04 23:50 ` bugzilla-daemon
@ 2018-12-05  1:08 ` bugzilla-daemon
  2018-12-05  1:10 ` bugzilla-daemon
                   ` (9 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-05  1:08 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

Scott Ellis (scotte@warped.com) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |scotte@warped.com

--- Comment #259 from Scott Ellis (scotte@warped.com) ---
In my case (kernel 4.19.6 and a 2-vdev/6-drive raidz1) doing an rsync to/from
the same ZFS filesystem would generate ~1 error every 5s or so (on a random
drive on the pool).  With the patch from #255 I have been through 300GB of the
rsync w/o any errors.  Throughput of the rsync is identical before/after the
patch.

#255 feels like a good patch.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (260 preceding siblings ...)
  2018-12-05  1:08 ` bugzilla-daemon
@ 2018-12-05  1:10 ` bugzilla-daemon
  2018-12-05  1:40 ` bugzilla-daemon
                   ` (8 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-05  1:10 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #260 from Eric Benoit (eric@ecks.ca) ---
About an hour now with 4.19.6 and the patch from #255 without a peep from zed.
I think we have a winner!

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (261 preceding siblings ...)
  2018-12-05  1:10 ` bugzilla-daemon
@ 2018-12-05  1:40 ` bugzilla-daemon
  2018-12-05  6:31 ` bugzilla-daemon
                   ` (7 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-05  1:40 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #261 from Guenter Roeck (linux@roeck-us.net) ---
Two hours. Agreed - looks like a winner.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (262 preceding siblings ...)
  2018-12-05  1:40 ` bugzilla-daemon
@ 2018-12-05  6:31 ` bugzilla-daemon
  2018-12-05  8:48 ` bugzilla-daemon
                   ` (6 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-05  6:31 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #262 from Néstor A. Marchesini (nestorm_des@hotmail.com) ---
I have patched with #255 Jens Axboe 4.19/4.20 patch v3 to my trees 4.19.0
4.19.5 4.19.6 and I can say that for the first time I am able to use 4.19.0
reading #259 reminds me that every time I updated I started having problems
with rsync and I had to do it several times to get it right ... I innocently
thought it was a problem in the repository.
I was investigating my hdds in case they did not have real physical problems
but:

# smartctl -H /dev/sda
SMART overall-health self-assessment test result: PASSED

# smartctl -H /dev/sdb
SMART overall-health self-assessment test result: PASSED

Very glad that the cause of the evils has been found ...to see that it
continues now.

$ uname -r
Linux pc-user 4.19.0-gentoo

Regards

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (263 preceding siblings ...)
  2018-12-05  6:31 ` bugzilla-daemon
@ 2018-12-05  8:48 ` bugzilla-daemon
  2018-12-05  9:06 ` bugzilla-daemon
                   ` (5 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-05  8:48 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #263 from Rainer Fiebig (jrf@mailbox.org) ---
(In reply to Guenter Roeck from comment #240)
> As mentioned earlier, I only ever saw the problem on two of four systems
> (see #57), all running the same kernel and the same version of Ubuntu. The
> only differences are mainboard, CPU, and attached drive types.
> 
> I don't think we know for sure what it takes to trigger the problem. We have
> seen various guesses, from gcc version to l1tf mitigation to CPU type,
> broken hard drives, and whatnot. At this time evidence points to the block
> subsystem, with bisect pointing to a commit which relies on the state of the
> HW queue (empty or not) in conjunction with the 'none' io scheduler. This
> may suggest that drive speed and access timing may be involved. That guess
> may of course be just as wrong as all the others.
> 
> Let's just hope that Jens will be able to track down and fix the problem.
> Then we may be able to get a better idea what it actually takes to trigger
> it.

It would indeed be nice to get a short summary *here* of what happened and why,
once the dust has settled.

It would also be interesting to know why all the testing in the run-up to 4.19
didn't catch it, including rc-kernels. It's imo for instance unlikely that
everybody just tested with CONFIG_SCSI_MQ_DEFAULT=n.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (264 preceding siblings ...)
  2018-12-05  8:48 ` bugzilla-daemon
@ 2018-12-05  9:06 ` bugzilla-daemon
  2018-12-05  9:33 ` bugzilla-daemon
                   ` (4 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-05  9:06 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #264 from Rafał Miłecki (zajec5@gmail.com) ---
Is this possible to avoid this bug by using some command line parameter or
setting some sysfs entry? Something I could use on my machines before the fix
get included in my distribution?

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (265 preceding siblings ...)
  2018-12-05  9:06 ` bugzilla-daemon
@ 2018-12-05  9:33 ` bugzilla-daemon
  2018-12-05  9:51 ` bugzilla-daemon
                   ` (3 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-05  9:33 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #265 from Ortwin Glück (odi@odi.ch) ---
I think Jens pretty much summarized the situation in #245. To trigger the bug
blk-mq must be used together with an underlying block device (such as SCSI or
SATA) that is stateful after a rejected bio submit. Then it's just a matter of
enough concurrent I/O.

So a workaround is to just disable blk-mq with SCIS: scsi_mod.use_blk_mq=1

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (266 preceding siblings ...)
  2018-12-05  9:33 ` bugzilla-daemon
@ 2018-12-05  9:51 ` bugzilla-daemon
  2018-12-05  9:56 ` bugzilla-daemon
                   ` (2 subsequent siblings)
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-05  9:51 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #266 from Rafał Miłecki (zajec5@gmail.com) ---
> So a workaround is to just disable blk-mq with SCIS: scsi_mod.use_blk_mq=1

I wasn't sure if that's a reliable workaround as it was being discussed before
the fix was provided. Thank you for clarifying that!

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (267 preceding siblings ...)
  2018-12-05  9:51 ` bugzilla-daemon
@ 2018-12-05  9:56 ` bugzilla-daemon
  2018-12-05 10:03 ` bugzilla-daemon
  2018-12-05 10:17 ` [Bug 201685] Incorrect disk writes caused by blk-mq lead to " bugzilla-daemon
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-05  9:56 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #267 from Ortwin Glück (odi@odi.ch) ---
Of course that should be a zero to disable it: scsi_mod.use_blk_mq=0

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] ext4 file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (268 preceding siblings ...)
  2018-12-05  9:56 ` bugzilla-daemon
@ 2018-12-05 10:03 ` bugzilla-daemon
  2018-12-05 10:17 ` [Bug 201685] Incorrect disk writes caused by blk-mq lead to " bugzilla-daemon
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-05 10:03 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

--- Comment #268 from Marc Burkhardt (marc@osknowledge.org) ---
(In reply to Rainer Fiebig from comment #263)
> (In reply to Guenter Roeck from comment #240)
> > As mentioned earlier, I only ever saw the problem on two of four systems
> > (see #57), all running the same kernel and the same version of Ubuntu. The
> > only differences are mainboard, CPU, and attached drive types.
> > 
> > I don't think we know for sure what it takes to trigger the problem. We
> have
> > seen various guesses, from gcc version to l1tf mitigation to CPU type,
> > broken hard drives, and whatnot. At this time evidence points to the block
> > subsystem, with bisect pointing to a commit which relies on the state of
> the
> > HW queue (empty or not) in conjunction with the 'none' io scheduler. This
> > may suggest that drive speed and access timing may be involved. That guess
> > may of course be just as wrong as all the others.
> > 
> > Let's just hope that Jens will be able to track down and fix the problem.
> > Then we may be able to get a better idea what it actually takes to trigger
> > it.
> 
> It would indeed be nice to get a short summary *here* of what happened and
> why, once the dust has settled.
> 
> It would also be interesting to know why all the testing in the run-up to
> 4.19 didn't catch it, including rc-kernels. It's imo for instance unlikely
> that everybody just tested with CONFIG_SCSI_MQ_DEFAULT=n.

As mentioned earlier:

it would be nice to have a definitive list of ciscumstances that are likely to
have the bug triggered so people can check if they are probably affected
because the _ran_ their systems with these setting and possibly have garbage on
their disks now...

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

* [Bug 201685] Incorrect disk writes caused by blk-mq lead to file system corruption
  2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
                   ` (269 preceding siblings ...)
  2018-12-05 10:03 ` bugzilla-daemon
@ 2018-12-05 10:17 ` bugzilla-daemon
  270 siblings, 0 replies; 272+ messages in thread
From: bugzilla-daemon @ 2018-12-05 10:17 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=201685

Artem S. Tashkinov (aros@gmx.com) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|ext4                        |Block Layer
     Kernel Version|maybe one of 4.18.18 4.19.1 |4.19.x 4.20-rc
                   |4.20-rc2                    |
           Assignee|fs_ext4@kernel-bugs.osdl.or |axboe@kernel.dk
                   |g                           |
            Product|File System                 |IO/Storage
         Regression|No                          |Yes
            Summary|ext4 file system corruption |Incorrect disk writes
                   |                            |caused by blk-mq lead to
                   |                            |file system corruption

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 272+ messages in thread

end of thread, other threads:[~2018-12-05 10:17 UTC | newest]

Thread overview: 272+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-13 19:42 [Bug 201685] New: ext4 file system corruption bugzilla-daemon
2018-11-14 21:20 ` [Bug 201685] " bugzilla-daemon
2018-11-15  4:37 ` bugzilla-daemon
2018-11-15 16:19 ` bugzilla-daemon
2018-11-15 16:43 ` bugzilla-daemon
2018-11-16 15:09 ` bugzilla-daemon
2018-11-16 19:03 ` bugzilla-daemon
2018-11-20  5:57 ` bugzilla-daemon
2018-11-21  0:38 ` bugzilla-daemon
2018-11-21  0:41 ` bugzilla-daemon
2018-11-21 14:48 ` bugzilla-daemon
2018-11-21 16:12 ` bugzilla-daemon
2018-11-21 17:26 ` bugzilla-daemon
2018-11-21 18:17 ` bugzilla-daemon
2018-11-21 18:19 ` bugzilla-daemon
2018-11-21 18:21 ` bugzilla-daemon
2018-11-21 18:25 ` bugzilla-daemon
2018-11-21 18:28 ` bugzilla-daemon
2018-11-21 18:44 ` bugzilla-daemon
2018-11-21 18:50 ` bugzilla-daemon
2018-11-21 20:15 ` bugzilla-daemon
2018-11-21 20:36 ` bugzilla-daemon
2018-11-21 22:06 ` bugzilla-daemon
2018-11-21 22:59 ` bugzilla-daemon
2018-11-22  1:47 ` bugzilla-daemon
2018-11-22  2:02 ` bugzilla-daemon
2018-11-22  2:08 ` bugzilla-daemon
2018-11-22  2:08 ` bugzilla-daemon
2018-11-22  2:37 ` bugzilla-daemon
2018-11-22  9:14 ` bugzilla-daemon
2018-11-22 11:51 ` bugzilla-daemon
2018-11-22 15:22 ` bugzilla-daemon
2018-11-22 15:29 ` bugzilla-daemon
2018-11-22 17:04 ` bugzilla-daemon
2018-11-22 19:38 ` bugzilla-daemon
2018-11-22 19:57 ` bugzilla-daemon
2018-11-22 20:03 ` bugzilla-daemon
2018-11-23  0:02 ` bugzilla-daemon
2018-11-24 12:08 ` bugzilla-daemon
2018-11-24 13:07 ` bugzilla-daemon
2018-11-24 14:10 ` bugzilla-daemon
2018-11-25  7:59 ` bugzilla-daemon
2018-11-25  8:02 ` bugzilla-daemon
2018-11-25 21:47 ` bugzilla-daemon
2018-11-25 22:06 ` bugzilla-daemon
2018-11-25 22:24 ` bugzilla-daemon
2018-11-26  0:00 ` bugzilla-daemon
2018-11-26  0:04 ` bugzilla-daemon
2018-11-26  8:49 ` bugzilla-daemon
2018-11-26 12:23 ` bugzilla-daemon
2018-11-26 12:24 ` bugzilla-daemon
2018-11-26 12:25 ` bugzilla-daemon
2018-11-26 15:49 ` bugzilla-daemon
2018-11-26 16:34 ` bugzilla-daemon
2018-11-27  1:32 ` bugzilla-daemon
2018-11-27 12:24 ` bugzilla-daemon
2018-11-27 17:02 ` bugzilla-daemon
2018-11-27 21:54 ` bugzilla-daemon
2018-11-28  0:06 ` bugzilla-daemon
2018-11-28  5:05 ` bugzilla-daemon
2018-11-28  5:06 ` bugzilla-daemon
2018-11-28  5:10 ` bugzilla-daemon
2018-11-28  8:30 ` bugzilla-daemon
2018-11-28 14:40 ` bugzilla-daemon
2018-11-28 15:09 ` bugzilla-daemon
2018-11-28 15:17 ` bugzilla-daemon
2018-11-28 16:26 ` bugzilla-daemon
2018-11-28 17:28 ` bugzilla-daemon
2018-11-28 20:42 ` bugzilla-daemon
2018-11-28 22:47 ` bugzilla-daemon
2018-11-29  3:20 ` bugzilla-daemon
2018-11-29  4:48 ` bugzilla-daemon
2018-11-29 11:12 ` bugzilla-daemon
2018-11-29 16:32 ` bugzilla-daemon
2018-11-29 16:34 ` bugzilla-daemon
2018-11-29 22:38 ` bugzilla-daemon
2018-11-29 22:52 ` bugzilla-daemon
2018-11-30  1:06 ` bugzilla-daemon
2018-11-30  1:15 ` bugzilla-daemon
2018-11-30  4:10 ` bugzilla-daemon
2018-11-30  5:01 ` bugzilla-daemon
2018-11-30  7:18 ` bugzilla-daemon
2018-11-30  7:32 ` bugzilla-daemon
2018-11-30  7:51 ` bugzilla-daemon
2018-11-30  8:43 ` bugzilla-daemon
2018-11-30 10:37 ` bugzilla-daemon
2018-11-30 11:09 ` bugzilla-daemon
2018-11-30 12:10 ` bugzilla-daemon
2018-11-30 14:20 ` bugzilla-daemon
2018-11-30 15:44 ` bugzilla-daemon
2018-11-30 15:49 ` bugzilla-daemon
2018-11-30 17:08 ` bugzilla-daemon
2018-11-30 17:22 ` bugzilla-daemon
2018-11-30 17:47 ` bugzilla-daemon
2018-11-30 18:01 ` bugzilla-daemon
2018-11-30 18:05 ` bugzilla-daemon
2018-11-30 18:07 ` bugzilla-daemon
2018-11-30 18:40 ` bugzilla-daemon
2018-11-30 18:45 ` bugzilla-daemon
2018-11-30 18:54 ` bugzilla-daemon
2018-11-30 19:02 ` bugzilla-daemon
2018-12-01  1:25 ` bugzilla-daemon
2018-12-01  2:34 ` bugzilla-daemon
2018-12-01  3:43 ` bugzilla-daemon
2018-12-01  4:00 ` bugzilla-daemon
2018-12-01  9:25 ` bugzilla-daemon
2018-12-01 12:57 ` bugzilla-daemon
2018-12-01 14:20 ` bugzilla-daemon
2018-12-01 14:28 ` bugzilla-daemon
2018-12-01 14:52 ` bugzilla-daemon
2018-12-01 15:16 ` bugzilla-daemon
2018-12-01 15:35 ` bugzilla-daemon
2018-12-01 15:39 ` bugzilla-daemon
2018-12-01 18:27 ` bugzilla-daemon
2018-12-01 19:49 ` bugzilla-daemon
2018-12-01 21:13 ` bugzilla-daemon
2018-12-01 23:44 ` bugzilla-daemon
2018-12-02  0:01 ` bugzilla-daemon
2018-12-02  0:23 ` bugzilla-daemon
2018-12-02  0:37 ` bugzilla-daemon
2018-12-02  0:44 ` bugzilla-daemon
2018-12-02  0:48 ` bugzilla-daemon
2018-12-02  0:50 ` bugzilla-daemon
2018-12-02  0:52 ` bugzilla-daemon
2018-12-02  0:52 ` bugzilla-daemon
2018-12-02  0:56 ` bugzilla-daemon
2018-12-02  1:25 ` bugzilla-daemon
2018-12-02  3:36 ` bugzilla-daemon
2018-12-02  4:07 ` bugzilla-daemon
2018-12-02  4:20 ` bugzilla-daemon
2018-12-02 10:20 ` bugzilla-daemon
2018-12-02 10:24 ` bugzilla-daemon
2018-12-02 10:25 ` bugzilla-daemon
2018-12-02 10:46 ` bugzilla-daemon
2018-12-02 11:36 ` bugzilla-daemon
2018-12-02 11:36 ` bugzilla-daemon
2018-12-02 11:41 ` bugzilla-daemon
2018-12-02 11:57 ` bugzilla-daemon
2018-12-02 11:59 ` bugzilla-daemon
2018-12-02 12:01 ` bugzilla-daemon
2018-12-02 12:07 ` bugzilla-daemon
2018-12-02 12:27 ` bugzilla-daemon
2018-12-02 12:38 ` bugzilla-daemon
2018-12-02 13:28 ` bugzilla-daemon
2018-12-02 13:35 ` bugzilla-daemon
2018-12-02 13:43 ` bugzilla-daemon
2018-12-02 14:06 ` bugzilla-daemon
2018-12-02 14:14 ` bugzilla-daemon
2018-12-02 14:21 ` bugzilla-daemon
2018-12-02 14:38 ` bugzilla-daemon
2018-12-02 14:42 ` bugzilla-daemon
2018-12-02 16:57 ` bugzilla-daemon
2018-12-02 17:48 ` bugzilla-daemon
2018-12-02 17:50 ` bugzilla-daemon
2018-12-02 18:19 ` bugzilla-daemon
2018-12-02 18:56 ` bugzilla-daemon
2018-12-02 19:07 ` bugzilla-daemon
2018-12-02 19:10 ` bugzilla-daemon
2018-12-02 19:17 ` bugzilla-daemon
2018-12-02 19:18 ` bugzilla-daemon
2018-12-02 19:19 ` bugzilla-daemon
2018-12-02 19:20 ` bugzilla-daemon
2018-12-02 19:22 ` bugzilla-daemon
2018-12-02 19:29 ` bugzilla-daemon
2018-12-02 19:34 ` bugzilla-daemon
2018-12-02 20:34 ` bugzilla-daemon
2018-12-02 20:45 ` bugzilla-daemon
2018-12-02 20:47 ` bugzilla-daemon
2018-12-02 20:57 ` bugzilla-daemon
2018-12-02 21:34 ` bugzilla-daemon
2018-12-03  0:07 ` bugzilla-daemon
2018-12-03  0:11 ` bugzilla-daemon
2018-12-03  0:28 ` bugzilla-daemon
2018-12-03  0:59 ` bugzilla-daemon
2018-12-03  1:03 ` bugzilla-daemon
2018-12-03  1:04 ` bugzilla-daemon
2018-12-03  1:17 ` bugzilla-daemon
2018-12-03  1:23 ` bugzilla-daemon
2018-12-03  1:37 ` bugzilla-daemon
2018-12-03  1:48 ` bugzilla-daemon
2018-12-03  1:50 ` bugzilla-daemon
2018-12-03  2:31 ` bugzilla-daemon
2018-12-03  2:43 ` bugzilla-daemon
2018-12-03  2:53 ` bugzilla-daemon
2018-12-03  3:00 ` bugzilla-daemon
2018-12-03  3:08 ` bugzilla-daemon
2018-12-03  3:34 ` bugzilla-daemon
2018-12-03  3:40 ` bugzilla-daemon
2018-12-03  3:51 ` bugzilla-daemon
2018-12-03  4:06 ` bugzilla-daemon
2018-12-03  4:31 ` bugzilla-daemon
2018-12-03  8:33 ` bugzilla-daemon
2018-12-03  9:24 ` bugzilla-daemon
2018-12-03 10:42 ` bugzilla-daemon
2018-12-03 10:46 ` bugzilla-daemon
2018-12-03 11:02 ` bugzilla-daemon
2018-12-03 11:03 ` bugzilla-daemon
2018-12-03 11:08 ` bugzilla-daemon
2018-12-03 11:09 ` bugzilla-daemon
2018-12-03 14:18 ` bugzilla-daemon
2018-12-03 14:20 ` bugzilla-daemon
2018-12-03 14:57 ` bugzilla-daemon
2018-12-03 15:10 ` bugzilla-daemon
2018-12-03 15:25 ` bugzilla-daemon
2018-12-03 15:25 ` bugzilla-daemon
2018-12-03 16:20 ` bugzilla-daemon
2018-12-03 17:01 ` bugzilla-daemon
2018-12-03 18:05 ` bugzilla-daemon
2018-12-03 19:42 ` bugzilla-daemon
2018-12-03 19:56 ` bugzilla-daemon
2018-12-03 20:38 ` bugzilla-daemon
2018-12-03 21:16 ` bugzilla-daemon
2018-12-03 21:20 ` bugzilla-daemon
2018-12-03 22:19 ` bugzilla-daemon
2018-12-04  1:41 ` bugzilla-daemon
2018-12-04  3:35 ` bugzilla-daemon
2018-12-04  3:39 ` bugzilla-daemon
2018-12-04  6:04 ` bugzilla-daemon
2018-12-04  7:06 ` bugzilla-daemon
2018-12-04  8:24 ` bugzilla-daemon
2018-12-04  8:41 ` bugzilla-daemon
2018-12-04  8:44 ` bugzilla-daemon
2018-12-04 10:16 ` bugzilla-daemon
2018-12-04 11:46 ` bugzilla-daemon
2018-12-04 15:03 ` bugzilla-daemon
2018-12-04 17:52 ` bugzilla-daemon
2018-12-04 18:04 ` bugzilla-daemon
2018-12-04 18:09 ` bugzilla-daemon
2018-12-04 18:14 ` bugzilla-daemon
2018-12-04 18:22 ` bugzilla-daemon
2018-12-04 18:29 ` bugzilla-daemon
2018-12-04 18:33 ` bugzilla-daemon
2018-12-04 18:34 ` bugzilla-daemon
2018-12-04 18:37 ` bugzilla-daemon
2018-12-04 18:37 ` bugzilla-daemon
2018-12-04 18:47 ` bugzilla-daemon
2018-12-04 18:48 ` bugzilla-daemon
2018-12-04 18:59 ` bugzilla-daemon
2018-12-04 19:25 ` bugzilla-daemon
2018-12-04 20:25 ` bugzilla-daemon
2018-12-04 20:36 ` bugzilla-daemon
2018-12-04 20:59 ` bugzilla-daemon
2018-12-04 21:01 ` bugzilla-daemon
2018-12-04 21:11 ` bugzilla-daemon
2018-12-04 21:17 ` bugzilla-daemon
2018-12-04 21:37 ` bugzilla-daemon
2018-12-04 21:43 ` bugzilla-daemon
2018-12-04 21:50 ` bugzilla-daemon
2018-12-04 21:59 ` bugzilla-daemon
2018-12-04 22:04 ` bugzilla-daemon
2018-12-04 22:06 ` bugzilla-daemon
2018-12-04 22:08 ` bugzilla-daemon
2018-12-04 22:11 ` bugzilla-daemon
2018-12-04 22:12 ` bugzilla-daemon
2018-12-04 22:17 ` bugzilla-daemon
2018-12-04 22:32 ` bugzilla-daemon
2018-12-04 22:35 ` bugzilla-daemon
2018-12-04 22:50 ` bugzilla-daemon
2018-12-04 23:28 ` bugzilla-daemon
2018-12-04 23:38 ` bugzilla-daemon
2018-12-04 23:50 ` bugzilla-daemon
2018-12-05  1:08 ` bugzilla-daemon
2018-12-05  1:10 ` bugzilla-daemon
2018-12-05  1:40 ` bugzilla-daemon
2018-12-05  6:31 ` bugzilla-daemon
2018-12-05  8:48 ` bugzilla-daemon
2018-12-05  9:06 ` bugzilla-daemon
2018-12-05  9:33 ` bugzilla-daemon
2018-12-05  9:51 ` bugzilla-daemon
2018-12-05  9:56 ` bugzilla-daemon
2018-12-05 10:03 ` bugzilla-daemon
2018-12-05 10:17 ` [Bug 201685] Incorrect disk writes caused by blk-mq lead to " bugzilla-daemon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.