All of lore.kernel.org
 help / color / mirror / Atom feed
* [Bug 102731] New: I have a cough.
@ 2015-08-12  8:47 bugzilla-daemon
  2015-08-12  8:56 ` [Bug 102731] " bugzilla-daemon
                   ` (45 more replies)
  0 siblings, 46 replies; 47+ messages in thread
From: bugzilla-daemon @ 2015-08-12  8:47 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=102731

            Bug ID: 102731
           Summary: I have a cough.
           Product: File System
           Version: 2.5
    Kernel Version: 3.16
          Hardware: x86-64
                OS: Linux
              Tree: Mainline
            Status: NEW
          Severity: normal
          Priority: P1
         Component: ext4
          Assignee: fs_ext4@kernel-bugs.osdl.org
          Reporter: john@calva.com
        Regression: No

This is a bug opened to talk about my symptoms that lead up to opening bug 

https://bugzilla.kernel.org/show_bug.cgi?id=89621

as requested by Theodore Tso

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug 102731] I have a cough.
  2015-08-12  8:47 [Bug 102731] New: I have a cough bugzilla-daemon
@ 2015-08-12  8:56 ` bugzilla-daemon
  2015-08-12  9:02 ` bugzilla-daemon
                   ` (44 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: bugzilla-daemon @ 2015-08-12  8:56 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=102731

--- Comment #1 from John Hughes <john@calva.com> ---
The hardware:

A couple of Dell PowerEdge 840's (Xeon 3050, 4G).

disks SATA internal and SCSI in external cabinets

Host OS Debian 7.7 kernel 3.16, version  3.16.5-1~bpo70+1

Disks assembled into RAID1 arrays then passed to kvm guest as virtio.

Guest OS Debian 7.7, kernel either as host, or now 3.18.19

Guest uses LVM2

All filesystems on guest are ext3, now handled by ext4 subsystem.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug 102731] I have a cough.
  2015-08-12  8:47 [Bug 102731] New: I have a cough bugzilla-daemon
  2015-08-12  8:56 ` [Bug 102731] " bugzilla-daemon
@ 2015-08-12  9:02 ` bugzilla-daemon
  2015-08-12  9:11 ` bugzilla-daemon
                   ` (43 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: bugzilla-daemon @ 2015-08-12  9:02 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=102731

--- Comment #2 from John Hughes <john@calva.com> ---
host uptime:

 10:59:22 up 24 days, 14:21,  2 users,  load average: 0.39, 0.51, 0.90

(Last reboot caused by power cut)

PCI config on host:

00:00.0 Host bridge: Intel Corporation E7230/3000/3010 Memory Controller Hub
00:01.0 PCI bridge: Intel Corporation E7230/3000/3010 PCI Express Root Port
00:1c.0 PCI bridge: Intel Corporation NM10/ICH7 Family PCI Express Port 1 (rev
01)
00:1c.4 PCI bridge: Intel Corporation 82801GR/GH/GHM (ICH7 Family) PCI Express
Port 5 (rev 01)
00:1c.5 PCI bridge: Intel Corporation 82801GR/GH/GHM (ICH7 Family) PCI Express
Port 6 (rev 01)
00:1d.0 USB controller: Intel Corporation NM10/ICH7 Family USB UHCI Controller
#1 (rev 01)
00:1d.1 USB controller: Intel Corporation NM10/ICH7 Family USB UHCI Controller
#2 (rev 01)
00:1d.2 USB controller: Intel Corporation NM10/ICH7 Family USB UHCI Controller
#3 (rev 01)
00:1d.7 USB controller: Intel Corporation NM10/ICH7 Family USB2 EHCI Controller
(rev 01)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev e1)
00:1f.0 ISA bridge: Intel Corporation 82801GB/GR (ICH7 Family) LPC Interface
Bridge (rev 01)
00:1f.1 IDE interface: Intel Corporation 82801G (ICH7 Family) IDE Controller
(rev 01)
00:1f.3 SMBus: Intel Corporation NM10/ICH7 Family SMBus Controller (rev 01)
01:00.0 PCI bridge: Intel Corporation 6702PXH PCI Express-to-PCI Bridge A (rev
09)
01:00.1 PIC: Intel Corporation 6700/6702PXH I/OxAPIC Interrupt Controller A
(rev 09)
02:08.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068 PCI-X
Fusion-MPT SAS (rev 01)
03:00.0 PCI bridge: Intel Corporation 6702PXH PCI Express-to-PCI Bridge A (rev
09)
04:01.0 SCSI storage controller: Adaptec AHA-3960D / AIC-7899A U160/m (rev 01)
04:01.1 SCSI storage controller: Adaptec AHA-3960D / AIC-7899A U160/m (rev 01)
04:02.0 Ethernet controller: Intel Corporation 82543GC Gigabit Ethernet
Controller (Copper) (rev 02)
05:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5721 Gigabit
Ethernet PCI Express (rev 11)
07:00.0 Ethernet controller: Altima (nee Broadcom) AC9100 Gigabit Ethernet (rev
15)
07:05.0 VGA compatible controller: Advanced Micro Devices [AMD] nee ATI ES1000
(rev 02)


Disk configuration on host:

[0:0:0:0]    cd/dvd  LITE-ON  CD-ROM LTN-4891S NDS3  /dev/sr0 
[2:0:0:0]    disk    HITACHI  HUS151473VL3800  S110  /dev/sda 
[2:0:1:0]    disk    FUJITSU  MAP3735NC        5605  /dev/sdb 
[2:0:2:0]    disk    FUJITSU  MAP3735NC        5605  /dev/sdc 
[2:0:3:0]    disk    FUJITSU  MAP3735NC        5605  /dev/sdd 
[2:0:8:0]    disk    QUANTUM  ATLAS10K2-TY367J DA40  /dev/sde 
[2:0:10:0]   disk    QUANTUM  ATLAS10K2-TY367J DA40  /dev/sdf 
[3:0:0:0]    disk    IBM      IC35L036UCDY10-0 S27E  /dev/sdg 
[3:0:1:0]    disk    QUANTUM  ATLAS10K2-TY367J DA40  /dev/sdh 
[3:0:2:0]    disk    FUJITSU  MAP3735NC        5608  /dev/sdi 
[3:0:3:0]    disk    FUJITSU  MAP3735NC        5608  /dev/sdj 
[3:0:4:0]    disk    FUJITSU  MAP3735NC        5608  /dev/sdk 
[3:0:5:0]    disk    FUJITSU  MAP3735NC        5608  /dev/sdl 
[3:0:9:0]    disk    IBM      DMVS18M          0220  /dev/sdm 
[3:0:10:0]   disk    IBM      DMVS18D          02B0  /dev/sdn 
[3:0:11:0]   disk    IBM      IC35L036UCDY10-0 S27F  /dev/sdo 
[3:0:12:0]   disk    SEAGATE  ST318305LC       2203  /dev/sdp 
[3:0:13:0]   disk    FUJITSU  MAX3073NC        0104  /dev/sdq 
[3:0:15:0]   process Dell     12 BAY U2W CU    0209  -        
[4:0:0:0]    disk    ATA      ST3808110AS      J     /dev/sds 
[4:0:1:0]    disk    ATA      ST3808110AS      J     /dev/sdr 
[4:0:2:0]    disk    ATA      ST3808110AS      J     /dev/sdt 
[4:0:3:0]    disk    ATA      ST3808110AS      J     /dev/sdu 

mdadm config on host:

Personalities : [raid1] 
md122 : active raid1 sda1[2] sdj1[3]
      71680902 blocks super 1.2 [2/2] [UU]
      bitmap: 1/1 pages [4KB], 65536KB chunk

md123 : active raid1 sdb1[0] sdk1[1]
      71616384 blocks super 1.2 [2/2] [UU]
      bitmap: 1/1 pages [4KB], 65536KB chunk

md124 : active raid1 sdg1[3] sdf1[2]
      35549071 blocks super 1.2 [2/2] [UU]
      bitmap: 0/1 pages [0KB], 65536KB chunk

md125 : active raid1 sdh1[2] sde1[3]
      35549071 blocks super 1.2 [2/2] [UU]
      bitmap: 0/1 pages [0KB], 65536KB chunk

md126 : active raid1 sdc1[0] sdi1[2]
      71680902 blocks super 1.2 [2/2] [UU]
      bitmap: 1/1 pages [4KB], 65536KB chunk

md127 : active raid1 sdd[0] sdl[1]
      71621696 blocks super 1.2 [2/2] [UU]

md2 : active (auto-read-only) raid1 sdt1[2] sdu1[1]
      78121912 blocks super 1.2 [2/2] [UU]
      bitmap: 0/1 pages [0KB], 65536KB chunk

md1 : active raid1 sds2[0] sdr2[1]
      77623224 blocks super 1.2 [2/2] [UU]
      bitmap: 0/1 pages [0KB], 65536KB chunk

md0 : active raid1 sds1[0] sdr1[1]
      498676 blocks super 1.2 [2/2] [UU]
      bitmap: 0/1 pages [0KB], 65536KB chunk

Last dmesg messages on host:

[1225118.952197] md: data-check of RAID array md0
[1225118.952291] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[1225118.952384] md: using maximum available idle IO bandwidth (but not more
than 200000 KB/sec) for data-check.
[1225118.952528] md: using 128k window, over a total of 498676k.
[1225119.166383] md: delaying data-check of md1 until md0 has finished (they
share one or more physical units)
[1225119.178563] md: data-check of RAID array md122
[1225119.178644] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[1225119.178735] md: using maximum available idle IO bandwidth (but not more
than 200000 KB/sec) for data-check.
[1225119.178877] md: using 128k window, over a total of 71680902k.
[1225119.192249] md: data-check of RAID array md123
[1225119.192331] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[1225119.192422] md: using maximum available idle IO bandwidth (but not more
than 200000 KB/sec) for data-check.
[1225119.192564] md: using 128k window, over a total of 71616384k.
[1225119.205378] md: data-check of RAID array md124
[1225119.205461] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[1225119.210851] md: using maximum available idle IO bandwidth (but not more
than 200000 KB/sec) for data-check.
[1225119.216548] md: using 128k window, over a total of 35549071k.
[1225119.230698] md: data-check of RAID array md125
[1225119.236498] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[1225119.242171] md: using maximum available idle IO bandwidth (but not more
than 200000 KB/sec) for data-check.
[1225119.247721] md: using 128k window, over a total of 35549071k.
[1225119.265408] md: data-check of RAID array md126
[1225119.270962] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[1225119.276552] md: using maximum available idle IO bandwidth (but not more
than 200000 KB/sec) for data-check.
[1225119.282158] md: using 128k window, over a total of 71680902k.
[1225119.296781] md: data-check of RAID array md127
[1225119.302834] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[1225119.309046] md: using maximum available idle IO bandwidth (but not more
than 200000 KB/sec) for data-check.
[1225119.315202] md: using 128k window, over a total of 71621696k.
[1225127.248170] md: md0: data-check done.
[1225127.284640] md: data-check of RAID array md1
[1225127.290534] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[1225127.296453] md: using maximum available idle IO bandwidth (but not more
than 200000 KB/sec) for data-check.
[1225127.302521] md: using 128k window, over a total of 77623224k.
[1226450.960973] md: md1: data-check done.
[1226756.371905] md: md125: data-check done.
[1227131.051785] md: md127: data-check done.
[1227141.431898] md: md124: data-check done.
[1227808.276946] md: md126: data-check done.
[1227847.417636] md: md123: data-check done.
[1227978.744149] md: md122: data-check done.

Configuration of guest:

<domain type='kvm' id='4'>
  <name>olympic</name>
  <uuid>b42a5958-4258-1e0b-cbd0-05a0768c250b</uuid>
  <memory unit='KiB'>2097152</memory>
  <currentMemory unit='KiB'>1048576</currentMemory>
  <vcpu placement='static'>1</vcpu>
  <os>
    <type arch='x86_64' machine='pc-1.1'>hvm</type>
    <boot dev='hd'/>
  </os>
  <clock offset='utc' adjustment='reset'/>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>restart</on_crash>
  <devices>
    <emulator>/usr/bin/kvm</emulator>
    <disk type='block' device='disk'>
      <driver name='qemu' type='raw' cache='none'/>
      <source dev='/dev/disk/by-id/md-name-olympic:70a'/>
      <target dev='vda' bus='virtio'/>
      <alias name='virtio-disk0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04'
function='0x0'/>
    </disk>
    <disk type='block' device='disk'>
      <driver name='qemu' type='raw' cache='none'/>
      <source dev='/dev/disk/by-id/md-name-olympic:70b'/>
      <target dev='vdb' bus='virtio'/>
      <alias name='virtio-disk1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05'
function='0x0'/>
    </disk>
    <disk type='block' device='disk'>
      <driver name='qemu' type='raw' cache='none'/>
      <source dev='/dev/disk/by-id/md-name-olympic:34c'/>
      <target dev='vdc' bus='virtio'/>
      <alias name='virtio-disk2'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x06'
function='0x0'/>
    </disk>
    <disk type='block' device='disk'>
      <driver name='qemu' type='raw' cache='none'/>
      <source dev='/dev/disk/by-id/md-name-olympic:34d'/>
      <target dev='vdd' bus='virtio'/>
      <alias name='virtio-disk3'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07'
function='0x0'/>
    </disk>
    <disk type='block' device='disk'>
      <driver name='qemu' type='raw' cache='none'/>
      <source dev='/dev/disk/by-id/md-name-olympic:70e'/>
      <target dev='vde' bus='virtio'/>
      <alias name='virtio-disk4'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x08'
function='0x0'/>
    </disk>
    <disk type='block' device='disk'>
      <driver name='qemu' type='raw' cache='none'/>
      <source dev='/dev/disk/by-id/md-name-olympic:70f'/>
      <target dev='vdf' bus='virtio'/>
      <alias name='virtio-disk5'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x09'
function='0x0'/>
    </disk>
    <controller type='usb' index='0'>
      <alias name='usb0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01'
function='0x2'/>
    </controller>
    <interface type='bridge'>
      <mac address='00:16:3e:4b:86:33'/>
      <source bridge='calvaedi'/>
      <target dev='vnet0'/>
      <model type='virtio'/>
      <alias name='net0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03'
function='0x0'/>
    </interface>
    <serial type='pty'>
      <source path='/dev/pts/0'/>
      <target port='0'/>
      <alias name='serial0'/>
    </serial>
    <console type='pty' tty='/dev/pts/0'>
      <source path='/dev/pts/0'/>
      <target type='serial' port='0'/>
      <alias name='serial0'/>
    </console>
    <memballoon model='virtio'>
      <alias name='balloon0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x0a'
function='0x0'/>
    </memballoon>
  </devices>
  <seclabel type='none'/>
</domain>

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug 102731] I have a cough.
  2015-08-12  8:47 [Bug 102731] New: I have a cough bugzilla-daemon
  2015-08-12  8:56 ` [Bug 102731] " bugzilla-daemon
  2015-08-12  9:02 ` bugzilla-daemon
@ 2015-08-12  9:11 ` bugzilla-daemon
  2015-08-12  9:12 ` bugzilla-daemon
                   ` (42 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: bugzilla-daemon @ 2015-08-12  9:11 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=102731

--- Comment #3 from John Hughes <john@calva.com> ---
What is the guest doing?

It's a NFSv4 server for my home directories.

It's my email server (sendmail and cyrus imap).

I've seen the problem (bug 89621) on the root filesystem, the home filesystem
and the imap filesystem.

Most often on the root, a little less frequently the homedir filesystem.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug 102731] I have a cough.
  2015-08-12  8:47 [Bug 102731] New: I have a cough bugzilla-daemon
                   ` (2 preceding siblings ...)
  2015-08-12  9:11 ` bugzilla-daemon
@ 2015-08-12  9:12 ` bugzilla-daemon
  2015-08-12 18:53 ` bugzilla-daemon
                   ` (41 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: bugzilla-daemon @ 2015-08-12  9:12 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=102731

John Hughes <john@calva.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Regression|No                          |Yes

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug 102731] I have a cough.
  2015-08-12  8:47 [Bug 102731] New: I have a cough bugzilla-daemon
                   ` (3 preceding siblings ...)
  2015-08-12  9:12 ` bugzilla-daemon
@ 2015-08-12 18:53 ` bugzilla-daemon
  2015-08-12 19:25 ` bugzilla-daemon
                   ` (40 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: bugzilla-daemon @ 2015-08-12 18:53 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=102731

Theodore Tso <tytso@mit.edu> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |tytso@mit.edu

--- Comment #4 from Theodore Tso <tytso@mit.edu> ---
Can you upload your kernel .config file so I can try to replicate your
environment as closely as possible?

-- Ted

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug 102731] I have a cough.
  2015-08-12  8:47 [Bug 102731] New: I have a cough bugzilla-daemon
                   ` (4 preceding siblings ...)
  2015-08-12 18:53 ` bugzilla-daemon
@ 2015-08-12 19:25 ` bugzilla-daemon
  2015-08-31 15:46 ` bugzilla-daemon
                   ` (39 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: bugzilla-daemon @ 2015-08-12 19:25 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=102731

--- Comment #5 from John Hughes <john@calva.com> ---
Created attachment 184741
  --> https://bugzilla.kernel.org/attachment.cgi?id=184741&action=edit
config for kernel 3.18.19

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug 102731] I have a cough.
  2015-08-12  8:47 [Bug 102731] New: I have a cough bugzilla-daemon
                   ` (5 preceding siblings ...)
  2015-08-12 19:25 ` bugzilla-daemon
@ 2015-08-31 15:46 ` bugzilla-daemon
  2015-08-31 15:47 ` bugzilla-daemon
                   ` (38 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: bugzilla-daemon @ 2015-08-31 15:46 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=102731

--- Comment #6 from John Hughes <john@calva.com> ---
Well, it's just happened again.

[1728664.535798] EXT4-fs (dm-2): pa ffff88001ce2d368: logic 512, phys.
19774464, len 512
[1728664.538530] EXT4-fs error (device dm-2): ext4_mb_release_inode_pa:3773:
group 603, free 449, pa_free 448
[1728664.541378] Aborting journal on device dm-2-8.
[1728664.549493] EXT4-fs (dm-2): Remounting filesystem read-only
[1728664.551281] EXT4-fs error (device dm-2) in ext4_reserve_inode_write:4775:
Journal has aborted
[1728664.590678] EXT4-fs error (device dm-2): ext4_journal_check_start:56:
Detected aborted journal
[1728664.599749] EXT4-fs error (device dm-2) in ext4_orphan_del:2688: Journal
has aborted
[1728664.612118] EXT4-fs error (device dm-2) in ext4_reserve_inode_write:4775:
Journal has aborted

I'll attach the fsck output.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug 102731] I have a cough.
  2015-08-12  8:47 [Bug 102731] New: I have a cough bugzilla-daemon
                   ` (6 preceding siblings ...)
  2015-08-31 15:46 ` bugzilla-daemon
@ 2015-08-31 15:47 ` bugzilla-daemon
  2015-08-31 18:03 ` bugzilla-daemon
                   ` (37 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: bugzilla-daemon @ 2015-08-31 15:47 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=102731

--- Comment #7 from John Hughes <john@calva.com> ---
Created attachment 186381
  --> https://bugzilla.kernel.org/attachment.cgi?id=186381&action=edit
fsck output after reboot.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug 102731] I have a cough.
  2015-08-12  8:47 [Bug 102731] New: I have a cough bugzilla-daemon
                   ` (7 preceding siblings ...)
  2015-08-31 15:47 ` bugzilla-daemon
@ 2015-08-31 18:03 ` bugzilla-daemon
  2015-09-01 10:28 ` bugzilla-daemon
                   ` (36 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: bugzilla-daemon @ 2015-08-31 18:03 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=102731

--- Comment #8 from Theodore Tso <tytso@mit.edu> ---
How many guest VM's are you running, and how often does the file system
corruption happens?   It sounds like it's about once every 2-3 weeks?

Would you be willing to experiment running the VM without using the RAID1 so
we're bypassing the MD layer?   One of the common factors between those people
who are reporting this are (a) they are using KVM, and (b) they are using RAID.
  Now, I'm doing all of my testing using KVM, and I've never seen anything like
this.   So I wonder if the operative issue is happening at the MD layer....

You workload is common enough that I would expect that if this was happening
for everyone who was using ext4, I should be getting a lot more bug reports.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug 102731] I have a cough.
  2015-08-12  8:47 [Bug 102731] New: I have a cough bugzilla-daemon
                   ` (8 preceding siblings ...)
  2015-08-31 18:03 ` bugzilla-daemon
@ 2015-09-01 10:28 ` bugzilla-daemon
  2015-09-01 14:43 ` bugzilla-daemon
                   ` (35 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: bugzilla-daemon @ 2015-09-01 10:28 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=102731

--- Comment #9 from John Hughes <john@calva.com> ---
> How many guest VM's are you running

One at the moment.

> how often does the file system corruption happens?   It sounds like it's about once every 2-3 weeks?

With the Debian 3.16 kernel it was happening about once every 4 days.

I installed my 3.18.19 kernel on 24/7/2015 and have seen the bug twice:
11/8/2015 and 31/8/2015.

> Would you be willing to experiment running the VM without using the RAID1 so we're bypassing the MD layer?

I guess I'll have to try that.  It's going to be a monumental pain in the arse:

The guests /home is a 160GiB LVM volume striped across 4 mdadm raid1's.

What I can try first is just disabling one side of the mdadm mirrors, if the
problem re-occurs I can get rid of mdadm completely.

(This will be the first time I've run one of my servers without disk mirroring
since 1986.  Ugh.)

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug 102731] I have a cough.
  2015-08-12  8:47 [Bug 102731] New: I have a cough bugzilla-daemon
                   ` (9 preceding siblings ...)
  2015-09-01 10:28 ` bugzilla-daemon
@ 2015-09-01 14:43 ` bugzilla-daemon
  2015-09-01 16:08 ` bugzilla-daemon
                   ` (34 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: bugzilla-daemon @ 2015-09-01 14:43 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=102731

--- Comment #10 from cwseys@physics.wisc.edu ---
If you're getting failures faster in kernel 3.16 you might also get failures 
faster by backtracking
3.15.x
3.14.x
etc towards 3.2

until you have to wait long enough to be confident you've passed the bug.

Also, less surgery on Linux raid.

Chad.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug 102731] I have a cough.
  2015-08-12  8:47 [Bug 102731] New: I have a cough bugzilla-daemon
                   ` (10 preceding siblings ...)
  2015-09-01 14:43 ` bugzilla-daemon
@ 2015-09-01 16:08 ` bugzilla-daemon
  2015-09-16 14:09 ` bugzilla-daemon
                   ` (33 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: bugzilla-daemon @ 2015-09-01 16:08 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=102731

--- Comment #11 from John Hughes <john@calva.com> ---
On 01/09/15 16:43, bugzilla-daemon@bugzilla.kernel.org wrote:
> If you're getting failures faster in kernel 3.16 you might also get failures
> faster by backtracking
> 3.15.x
> 3.14.x
> etc towards 3.2
Well, since a similar bug is known to be present before 3.16.something 
that doesn't seem like a good idea.

But you're right in one thing, since the bug happens faster with the 
Debian 3.16 than with my 3.18.19 I might try running that one without 
the raid.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug 102731] I have a cough.
  2015-08-12  8:47 [Bug 102731] New: I have a cough bugzilla-daemon
                   ` (11 preceding siblings ...)
  2015-09-01 16:08 ` bugzilla-daemon
@ 2015-09-16 14:09 ` bugzilla-daemon
  2015-09-28 17:06 ` bugzilla-daemon
                   ` (32 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: bugzilla-daemon @ 2015-09-16 14:09 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=102731

--- Comment #12 from John Hughes <john@calva.com> ---
Ok, as of a few minutes ago I am running the Debian 3.16 kernel again, with all
the mdadm devices unmirrored.

More news if I see the problem again.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug 102731] I have a cough.
  2015-08-12  8:47 [Bug 102731] New: I have a cough bugzilla-daemon
                   ` (12 preceding siblings ...)
  2015-09-16 14:09 ` bugzilla-daemon
@ 2015-09-28 17:06 ` bugzilla-daemon
  2015-09-30  9:49 ` bugzilla-daemon
                   ` (31 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: bugzilla-daemon @ 2015-09-28 17:06 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=102731

--- Comment #13 from Theodore Tso <tytso@mit.edu> ---
So it's been 12 days, and previously when you were using the Debian 3.16
kernel, it was triggering once every four days, right?  Can I assume that your
silence indicates that you haven't seen a problem to date?

If so, then it really does seen that it might be an interaction between LVM/MD
and KVM.

So if that's the case, then the next thing to ask is to try to figure out what
might be the triggering cause.   A couple of things come to mind:

1) Some failure to properly handle a flush cache command being sent to the MD
device.  This combined to either a power failure or a crash of the guest OS
(depending on how KVM is configured), might explain a block update getting
lost.   The fact that the block bitmap is out of sync with the block group
descriptor is consistent with this failure.  However, if you were seeing
failures once every four days, that would imply that the guest OS and/or host
OS would be crashing at that or about that level of frequency, and you haven't
reported that. 

2) Some kind a race between a 4k write and a RAID1 resync leading to a block
write getting lost.  Again, this reported data corruption is consistent with
this theory --- but this also requires the guest OS crashing due to some kind
of kernel crash or KVM/qemu shutdown and/or host OS crash / power failure, as
in (1) above.  If you weren't seeing these failures once every four days or so,
then this isn't a likely explanation.

3)  Some kind of corruption caused by the TRIM command being sent to the
RAID/MD device, possibly racing with a block bitmap update.  This could be
caused either by the file system being mounted with the -o discard mount
option, or by fstrim getting run out of cron, or by e2fsck explicitly being
asked to discard unused blocks (with the "-E discard" option).

4)  Some kind of bug which happens rarely either in qemu, the host kernel or
the guest kernel depending on how it communicates with the virtual disk. 
(i.e., virtio, scsi, ide, etc.)   Virtio is the most likely use case, and so
trying to change to use scsi emulation might be interesting.  (OTOH, if the
problem is specific to the MD layer, then this possibility is less likely.)

So as far as #3 is concerned, can you check to see if you had fstrim enabled,
or are mounting the file system with -o discard?

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug 102731] I have a cough.
  2015-08-12  8:47 [Bug 102731] New: I have a cough bugzilla-daemon
                   ` (13 preceding siblings ...)
  2015-09-28 17:06 ` bugzilla-daemon
@ 2015-09-30  9:49 ` bugzilla-daemon
  2015-10-07 16:17 ` bugzilla-daemon
                   ` (30 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: bugzilla-daemon @ 2015-09-30  9:49 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=102731

--- Comment #14 from John Hughes <john@calva.com> ---
On 28/09/15 19:06, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=102731
>
> --- Comment #13 from Theodore Tso <tytso@mit.edu> ---
> So it's been 12 days, and previously when you were using the Debian 3.16
> kernel, it was triggering once every four days, right?  Can I assume that your
> silence indicates that you haven't seen a problem to date?

I haven't seen the problem, but unfortunately I'm running 3.18.19 at the 
moment (I screwed up on the last boot and let it boot the default 
kernel).  I haven't had time to reboot.  So I'd like to give it a bit 
more time.
>
> If so, then it really does seen that it might be an interaction between LVM/MD
> and KVM.
>
> So if that's the case, then the next thing to ask is to try to figure out what
> might be the triggering cause.   A couple of things come to mind:
>
> 1) Some failure to properly handle a flush cache command being sent to the MD
> device.  This combined to either a power failure or a crash of the guest OS
> (depending on how KVM is configured), might explain a block update getting
> lost.   The fact that the block bitmap is out of sync with the block group
> descriptor is consistent with this failure.  However, if you were seeing
> failures once every four days, that would imply that the guest OS and/or host
> OS would be crashing at that or about that level of frequency, and you haven't
> reported that.

I haven't had any host or guest crashes.

>
> 2) Some kind a race between a 4k write and a RAID1 resync leading to a block
> write getting lost.  Again, this reported data corruption is consistent with
> this theory --- but this also requires the guest OS crashing due to some kind
> of kernel crash or KVM/qemu shutdown and/or host OS crash / power failure, as
> in (1) above.  If you weren't seeing these failures once every four days or so,
> then this isn't a likely explanation.

No crashes.

>
> 3)  Some kind of corruption caused by the TRIM command being sent to the
> RAID/MD device, possibly racing with a block bitmap update.  This could be
> caused either by the file system being mounted with the -o discard mount
> option, or by fstrim getting run out of cron, or by e2fsck explicitly being
> asked to discard unused blocks (with the "-E discard" option).

I'm not using "-o discard", or fstrim, I've never used the "-E discard" 
option to fsck.
>
> 4)  Some kind of bug which happens rarely either in qemu, the host kernel or
> the guest kernel depending on how it communicates with the virtual disk.
> (i.e., virtio, scsi, ide, etc.)   Virtio is the most likely use case, and so
> trying to change to use scsi emulation might be interesting.  (OTOH, if the
> problem is specific to the MD layer, then this possibility is less likely.)
>
> So as far as #3 is concerned, can you check to see if you had fstrim enabled,
> or are mounting the file system with -o discard?
>

I'm a bit overwhelmed with work at the moment so I haven't had time to 
read this message with the care it deserves, I'll get back to you with 
more detail next week.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug 102731] I have a cough.
  2015-08-12  8:47 [Bug 102731] New: I have a cough bugzilla-daemon
                   ` (14 preceding siblings ...)
  2015-09-30  9:49 ` bugzilla-daemon
@ 2015-10-07 16:17 ` bugzilla-daemon
  2015-10-08  9:16 ` bugzilla-daemon
                   ` (29 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: bugzilla-daemon @ 2015-10-07 16:17 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=102731

--- Comment #15 from John Hughes <john@calva.com> ---
On 28/09/15 19:06, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=102731
>
> --- Comment #13 from Theodore Tso <tytso@mit.edu> ---
> So it's been 12 days, and previously when you were using the Debian 3.16
> kernel, it was triggering once every four days, right?  Can I assume that your
> silence indicates that you haven't seen a problem to date?

As I said I was, due to a silly mistake (I rebooted the system and 
forgot that the default kernel was still the 3.18.19 one) not running 
the Debian kernel.

And, after 19 days the problem has shown up again:

Oct  7 17:34:24 olympic kernel: [1657429.788105] EXT4-fs (dm-2): pa
ffff880004211a50: logic 512, phys. 10298368, len 512
Oct  7 17:34:24 olympic kernel: [1657429.790412] EXT4-fs error (device dm-2):
ext4_mb_release_inode_pa:3773: group 314, free 497, pa_free 495
Oct  7 17:34:24 olympic kernel: [1657429.793168] Aborting journal on device
dm-2-8.
Oct  7 17:34:24 olympic kernel: [1657429.795367] EXT4-fs (dm-2): Remounting
filesystem read-only


All the filesystem where it happened, "dm-2", is a LVM volume:

# lvdisplay -m /dev/olympic/olympic-home
   --- Logical volume ---
   LV Path                /dev/olympic/olympic-home
   LV Name                olympic-home
   VG Name                olympic
   LV UUID                drA6nQ-zbcu-SDLc-UeXH-foML-fPcB-ve1HQf
   LV Write Access        read/write
   LV Creation host, time ,
   LV Status              available
   # open                 1
   LV Size                160.00 GiB
   Current LE             40960
   Segments               3
   Allocation             inherit
   Read ahead sectors     auto
   - currently set to     512
   Block device           253:2

   --- Segments ---
   Logical extent 0 to 34999:
     Type        striped
     Stripes        2
     Stripe size        64.00 KiB
     Stripe 0:
       Physical volume    /dev/vdf
       Physical extents    0 to 17499
     Stripe 1:
       Physical volume    /dev/vdb
       Physical extents    0 to 17499

   Logical extent 35000 to 40861:
     Type        striped
     Stripes        2
     Stripe size        64.00 KiB
     Stripe 0:
       Physical volume    /dev/vdd
       Physical extents    4339 to 7269
     Stripe 1:
       Physical volume    /dev/vdc
       Physical extents    4388 to 7318

   Logical extent 40862 to 40959:
     Type        striped
     Stripes        2
     Stripe size        64.00 KiB
     Stripe 0:
       Physical volume    /dev/vdd
       Physical extents    7270 to 7318
     Stripe 1:
       Physical volume    /dev/vdc
       Physical extents    4339 to 4387

The devices, vdf, vdb, vdd, vdc are all unmirrored:

    <disk type='block' device='disk'>
       <driver name='qemu' type='raw' cache='none'/>
       <source dev='/dev/disk/by-id/md-name-olympic:70b'/>
       <target dev='vdb' bus='virtio'/>
       <alias name='virtio-disk1'/>
       <address type='pci' domain='0x0000' bus='0x00' slot='0x05'
function='0x0'/>
     </disk>
     <disk type='block' device='disk'>
       <driver name='qemu' type='raw' cache='none'/>
       <source dev='/dev/disk/by-id/md-name-olympic:34c'/>
       <target dev='vdc' bus='virtio'/>
       <alias name='virtio-disk2'/>
       <address type='pci' domain='0x0000' bus='0x00' slot='0x06'
function='0x0'/>
     </disk>
     <disk type='block' device='disk'>
       <driver name='qemu' type='raw' cache='none'/>
       <source dev='/dev/disk/by-id/md-name-olympic:34d'/>
       <target dev='vdd' bus='virtio'/>
       <alias name='virtio-disk3'/>
       <address type='pci' domain='0x0000' bus='0x00' slot='0x07'
function='0x0'/>
     </disk>
     ...
     <disk type='block' device='disk'>
       <driver name='qemu' type='raw' cache='none'/>
       <source dev='/dev/disk/by-id/md-name-olympic:70f'/>
       <target dev='vdf' bus='virtio'/>
       <alias name='virtio-disk5'/>
       <address type='pci' domain='0x0000' bus='0x00' slot='0x09'
function='0x0'/>
     </disk>

lrwxrwxrwx 1 root root 11 Oct  7 17:48 /dev/disk/by-id/md-name-olympic:34c ->
../../md125
lrwxrwxrwx 1 root root 11 Oct  7 17:48 /dev/disk/by-id/md-name-olympic:34d ->
../../md124
lrwxrwxrwx 1 root root 11 Oct  7 17:48 /dev/disk/by-id/md-name-olympic:70b ->
../../md126
lrwxrwxrwx 1 root root 11 Oct  7 17:48 /dev/disk/by-id/md-name-olympic:70f ->
../../md122

md122 : active raid1 sda1[2] sdj1[3](F)
       71680902 blocks super 1.2 [2/1] [U_]
--
md124 : active raid1 sdg1[3] sdf1[2](F)
       35549071 blocks super 1.2 [2/1] [U_]
--
md125 : active raid1 sdh1[2] sde1[3](F)
       35549071 blocks super 1.2 [2/1] [U_]
--
md126 : active raid1 sdc1[0] sdi1[2](F)
       71680902 blocks super 1.2 [2/1] [U_]

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug 102731] I have a cough.
  2015-08-12  8:47 [Bug 102731] New: I have a cough bugzilla-daemon
                   ` (15 preceding siblings ...)
  2015-10-07 16:17 ` bugzilla-daemon
@ 2015-10-08  9:16 ` bugzilla-daemon
  2015-10-11  4:05 ` bugzilla-daemon
                   ` (28 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: bugzilla-daemon @ 2015-10-08  9:16 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=102731

--- Comment #16 from John Hughes <john@calva.com> ---
Minor detail:

The ext4 decided the filesystem was corrupt on 7/10, but on 4/10 I'd run an
online check on it and it was OK.  (took a LVM snapshot, fsck'd that).

Oct  4 02:44:27 olympic online-check: .IN: Online fileystem check
/dev/olympic/olympic-home
Oct  4 02:52:00 olympic online-check: .NO: Volume /dev/olympic/olympic-home
checked OK
Oct  4 02:52:01 olympic online-check: .IN: Snapshot
/dev/olympic/olympic-home-snap used 0.10%

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug 102731] I have a cough.
  2015-08-12  8:47 [Bug 102731] New: I have a cough bugzilla-daemon
                   ` (16 preceding siblings ...)
  2015-10-08  9:16 ` bugzilla-daemon
@ 2015-10-11  4:05 ` bugzilla-daemon
  2015-10-12 10:36 ` bugzilla-daemon
                   ` (27 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: bugzilla-daemon @ 2015-10-11  4:05 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=102731

--- Comment #17 from Theodore Tso <tytso@mit.edu> ---
I just realized something which I forgot to ask you --- what file system are
you using on the *host*?  You said that on the guest you are using ext3 file
systems with the ext4 driver --- but what about on the host OS side?

I will say that using a full ext4 file system is far more stable on 3.18.21
than using an ext3 file system in compatibility mode.   Ext4 with the 4.2
kernel will support ext3 with zero unexpected test failures, but there are a
handful of test failures that are showing up with the 3.18.21 kernel.  I've
backported a few bug fixes, but they haven't shown up in the stable kernel
series yet, and there are still a half-dozen test failures that I haven't had
time to characterized yet.  (Note: support for long-term stable kernel is
something I do as a very low priority background task.  My priority is upstream
regressions and upstream development, and ideally someone would help with
testing the stable kernels and identifying patches that require manual
backports, but I haven't found a sucker yet.)

One of the reasons why I ask is that the PUNCH hole functionality was
relatively new in the 3.18 kernel, and KVM uses it --- and I am suspicious that
there might be some bug fixes that didn't land in the 3.18 kernel.   So one
thing that might be worth trying is to get a 4.2.3 kernel for both your Host
and Guest kernels, and see what that does for you.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug 102731] I have a cough.
  2015-08-12  8:47 [Bug 102731] New: I have a cough bugzilla-daemon
                   ` (17 preceding siblings ...)
  2015-10-11  4:05 ` bugzilla-daemon
@ 2015-10-12 10:36 ` bugzilla-daemon
  2015-10-12 14:01 ` bugzilla-daemon
                   ` (26 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: bugzilla-daemon @ 2015-10-12 10:36 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=102731

--- Comment #18 from John Hughes <john@calva.com> ---
> I just realized something which I forgot to ask you --- what file system are
> you using on the *host*?  You said that on the guest you are using ext3 file
> systems with the ext4 driver --- but what about on the host OS side?

The volumes passed to the guest are just mdadm volumes, no fs from the 
point of view of the host.

The host filesystems are all ext3 (on lvm, on mdadm) , the host runs the 
debian 3.16 backport kernel, but it does very little filesystem I/O.

> I will say that using a full ext4 file system is far more stable on 3.18.21
> than using an ext3 file system in compatibility mode.
What's the magic command to convert an ext3 fs to ext4?  The wisdom of 
net seems to say:

tune2fs -O extents,uninit_bg,dir_index /dev/xxx
fsck.ext4 -yfD /dev/xxx

Or would it be better to make a new FS and copy the files?
> One of the reasons why I ask is that the PUNCH hole functionality was
> relatively new in the 3.18 kernel, and KVM uses it --- and I am suspicious that
> there might be some bug fixes that didn't land in the 3.18 kernel.
Was it used in 3.16?  I first saw the problem with debians 3.16 based 
kernel.

>     So one
> thing that might be worth trying is to get a 4.2.3 kernel for both your Host
> and Guest kernels, and see what that does for you.

Ok, I'll investigate that.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug 102731] I have a cough.
  2015-08-12  8:47 [Bug 102731] New: I have a cough bugzilla-daemon
                   ` (18 preceding siblings ...)
  2015-10-12 10:36 ` bugzilla-daemon
@ 2015-10-12 14:01 ` bugzilla-daemon
  2015-10-15 15:32 ` bugzilla-daemon
                   ` (25 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: bugzilla-daemon @ 2015-10-12 14:01 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=102731

--- Comment #19 from cwseys@physics.wisc.edu ---
Might be interesting to convert the filesystem to XFS via a series of partition 
adds / copies / removes .  At least then the ext4 connection could be more 
firmly established.

(Actually, might want to "convert" to ext4 instead of XFS first and verify that 
the problem does not go away.)

Chad.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug 102731] I have a cough.
  2015-08-12  8:47 [Bug 102731] New: I have a cough bugzilla-daemon
                   ` (19 preceding siblings ...)
  2015-10-12 14:01 ` bugzilla-daemon
@ 2015-10-15 15:32 ` bugzilla-daemon
  2015-10-15 15:38 ` bugzilla-daemon
                   ` (24 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: bugzilla-daemon @ 2015-10-15 15:32 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=102731

--- Comment #20 from John Hughes <john@calva.com> ---
So yesterday it happened again:

Oct 14 18:08:48 olympic kernel: [605450.518092] EXT4-fs error (device dm-2):
ext4_mb_generate_buddy:757: group 434, block bitmap and bg descriptor
inconsistent: 1834 vs 1833 free clusters
Oct 14 18:08:48 olympic kernel: [605450.522151] Aborting journal on device
dm-2-8.
Oct 14 18:08:48 olympic kernel: [605450.524218] EXT4-fs (dm-2): Remounting
filesystem read-only
Oct 14 18:08:48 olympic kernel: [605450.572244] EXT4-fs error (device dm-2) in
ext4_reserve_inode_write:4775: Journal has aborted
Oct 14 18:08:48 olympic kernel: [605450.577125] EXT4-fs error (device dm-2) in
ext4_orphan_del:2688: Journal has aborted
Oct 14 18:08:48 olympic kernel: [605450.583858] EXT4-fs error (device dm-2) in
ext4_reserve_inode_write:4775: Journal has aborted

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug 102731] I have a cough.
  2015-08-12  8:47 [Bug 102731] New: I have a cough bugzilla-daemon
                   ` (20 preceding siblings ...)
  2015-10-15 15:32 ` bugzilla-daemon
@ 2015-10-15 15:38 ` bugzilla-daemon
  2015-10-15 15:41 ` bugzilla-daemon
                   ` (23 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: bugzilla-daemon @ 2015-10-15 15:38 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=102731

--- Comment #21 from John Hughes <john@calva.com> ---
(In reply to cwseys from comment #19)
> Might be interesting to convert the filesystem to XFS via a series of
> partition adds / copies / removes .  At least then the ext4 connection
> could be more firmly established.

Well, since I've demirrored the system (ugh) I have enough space to make a
copies of my filesystmes.

> (Actually, might want to "convert" to ext4 instead of XFS first and verify
> that the problem does not go away.)

I'm planning to move to ext4 over the weekend.  I've not yet decided whether to
do it in-place or by copying the data to a new filesystem.

Will doing an in place conversion be enough?

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug 102731] I have a cough.
  2015-08-12  8:47 [Bug 102731] New: I have a cough bugzilla-daemon
                   ` (21 preceding siblings ...)
  2015-10-15 15:38 ` bugzilla-daemon
@ 2015-10-15 15:41 ` bugzilla-daemon
  2015-10-16 13:04 ` bugzilla-daemon
                   ` (22 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: bugzilla-daemon @ 2015-10-15 15:41 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=102731

--- Comment #22 from John Hughes <john@calva.com> ---
But on reflection maybe it's better to move to 4.2 as Theodore says:

"Ext4 with the 4.2 kernel will support ext3 with zero unexpected test failures,
but there are a handful of test failures that are showing up with the 3.18.21
kernel."

Decisions, decisions.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug 102731] I have a cough.
  2015-08-12  8:47 [Bug 102731] New: I have a cough bugzilla-daemon
                   ` (22 preceding siblings ...)
  2015-10-15 15:41 ` bugzilla-daemon
@ 2015-10-16 13:04 ` bugzilla-daemon
  2015-10-16 15:53 ` bugzilla-daemon
                   ` (21 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: bugzilla-daemon @ 2015-10-16 13:04 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=102731

--- Comment #23 from John Hughes <john@calva.com> ---
So I've booted the guest with kernel 4.3.2:

Linux olympic 4.2.3-jh1 #2 SMP Thu Oct 15 19:04:09 CEST 2015 x86_64 GNU/Linux

I'm leaving the host on debians 3.16.5-1~bpo70+1 kernel for the moment.

If I see the problem again I'll move the host to 4.2.3.

If I see the problem with host and guest at 4.2.3 I'll convert the filesystems
(in place) to ext4.

If I see the problems with in-place converted ext4 I'll rebuild the filesysems
as native ext4 and copy all the files over.

Seems like a good plan?

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug 102731] I have a cough.
  2015-08-12  8:47 [Bug 102731] New: I have a cough bugzilla-daemon
                   ` (23 preceding siblings ...)
  2015-10-16 13:04 ` bugzilla-daemon
@ 2015-10-16 15:53 ` bugzilla-daemon
  2015-10-16 16:14 ` bugzilla-daemon
                   ` (20 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: bugzilla-daemon @ 2015-10-16 15:53 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=102731

--- Comment #24 from cwseys@physics.wisc.edu ---
Hi John,
    Thanks for working on this!
    I've attempted to send debian infrastructure an email asking if they've 
noticed this problem.  If not, I'd like to find the difference.  (Maybe they 
don't use ext4 in VMs.)

> If I see the problem with host and guest at 4.2.3 I'll convert the
> filesystems (in place) to ext4.
> 
> If I see the problems with in-place converted ext4 I'll rebuild the
> filesysems as native ext4 and copy all the files over.

    I am using  formatted as ext4 "native ext4" and seem to see the same 
problem.  But will be interesting to hear your results over a couple months . 
:)

Chad.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug 102731] I have a cough.
  2015-08-12  8:47 [Bug 102731] New: I have a cough bugzilla-daemon
                   ` (24 preceding siblings ...)
  2015-10-16 15:53 ` bugzilla-daemon
@ 2015-10-16 16:14 ` bugzilla-daemon
  2015-10-20 13:40 ` bugzilla-daemon
                   ` (19 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: bugzilla-daemon @ 2015-10-16 16:14 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=102731

--- Comment #25 from cwseys@physics.wisc.edu ---
P.S.
There is a bug report for Ubuntu VMs going read-only has similar backtraces. 

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1423672

There are some with "ext4_mb_generate_buddy" (NOT what John originally 
reported) and some with "ext4_mb_release_inode_pa" (John's original bug 
report).  They seem to be mutually exclusive - if one appears the other does 
not.

The ones with "ext4_mb_release_inode_pa" show up in kernel 3.13.? (Ubuntu's 
3.13.0-54-generic).

Theo said of backtraces containing "ext4_mb_generate_buddy" - "I believe the 
bug in question was fixed in a backport that first showed up in v3.16.2"

C.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug 102731] I have a cough.
  2015-08-12  8:47 [Bug 102731] New: I have a cough bugzilla-daemon
                   ` (25 preceding siblings ...)
  2015-10-16 16:14 ` bugzilla-daemon
@ 2015-10-20 13:40 ` bugzilla-daemon
  2015-10-20 15:44 ` bugzilla-daemon
                   ` (18 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: bugzilla-daemon @ 2015-10-20 13:40 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=102731

--- Comment #26 from John Hughes <john@calva.com> ---
Well I've just had the problem happen when running the 4.2.3 kernel (on the
guest, the host is still 3.16).

This time it was on the root filesystem, so no log messages unfortunately.

Will try and schedule some time to move the host to 4.2.3 this week.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug 102731] I have a cough.
  2015-08-12  8:47 [Bug 102731] New: I have a cough bugzilla-daemon
                   ` (26 preceding siblings ...)
  2015-10-20 13:40 ` bugzilla-daemon
@ 2015-10-20 15:44 ` bugzilla-daemon
  2015-10-20 15:55 ` bugzilla-daemon
                   ` (17 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: bugzilla-daemon @ 2015-10-20 15:44 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=102731

--- Comment #27 from John Hughes <john@calva.com> ---
And now I have had (with 4.2.3) an error on my /home filesystem:

Oct 20 16:06:58 olympic kernel: [ 8027.714881] EXT4-fs error (device dm-2):
ext4_mb_generate_buddy:758: group 519, block bitmap and bg descriptor
inconsistent: 22589 vs 22588 free clusters
Oct 20 16:06:58 olympic kernel: [ 8027.718806] Aborting journal on device
dm-2-8.
Oct 20 16:06:58 olympic kernel: [ 8027.722387] EXT4-fs (dm-2): Remounting
filesystem read-only
Oct 20 16:06:58 olympic kernel: [ 8027.724291] EXT4-fs error (device dm-2) in
ext4_free_blocks:4889: Journal has aborted
Oct 20 16:06:58 olympic kernel: [ 8027.727369] EXT4-fs error (device dm-2) in
ext4_do_update_inode:4504: Journal has aborted
Oct 20 16:06:58 olympic kernel: [ 8027.735809] EXT4-fs error (device dm-2) in
ext4_truncate:3789: IO failure
Oct 20 16:06:58 olympic kernel: [ 8027.739795] EXT4-fs error (device dm-2) in
ext4_orphan_del:2895: Journal has aborted
Oct 20 16:06:58 olympic kernel: [ 8027.743834] EXT4-fs error (device dm-2) in
ext4_do_update_inode:4504: Journal has aborted

Note that this is the ext4_mb_generate_buddy error cwseys mentioned.  AARGH!

When I ran fsck only one error was detected: "block bitmap differences
+17016320"

As the system was down I've moved the host to 4.2.3 (the users were already
screaming, so...)

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug 102731] I have a cough.
  2015-08-12  8:47 [Bug 102731] New: I have a cough bugzilla-daemon
                   ` (27 preceding siblings ...)
  2015-10-20 15:44 ` bugzilla-daemon
@ 2015-10-20 15:55 ` bugzilla-daemon
  2015-10-20 16:28 ` bugzilla-daemon
                   ` (16 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: bugzilla-daemon @ 2015-10-20 15:55 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=102731

--- Comment #28 from John Hughes <john@calva.com> ---
I've just noticed that the system had pretty low memory -- only 1G.  I've upped
it to 2G.  Don't know whether that will make a difference.  I see some
allocation failures, mostly relating to apache and tcp.

Oct 17 05:26:10 olympic kernel: [53046.469494] apache2: page allocation
failure: order:1, mode:0x204020
Oct 17 05:26:10 olympic kernel: [53046.469812] apache2: page allocation
failure: order:1, mode:0x204020
Oct 17 05:26:10 olympic kernel: [53046.469985] apache2: page allocation
failure: order:1, mode:0x204020
Oct 17 05:26:10 olympic kernel: [53046.470154] apache2: page allocation
failure: order:1, mode:0x204020
Oct 19 22:43:46 olympic kernel: [288101.737120] apache2: page allocation
failure: order:1, mode:0x204020
Oct 19 22:43:46 olympic kernel: [288101.737833] kswapd0: page allocation
failure: order:1, mode:0x204020
Oct 20 06:12:46 olympic kernel: [315042.431636] apache2: page allocation
failure: order:1, mode:0x204020
Oct 20 06:12:46 olympic kernel: [315042.431952] apache2: page allocation
failure: order:1, mode:0x204020
Oct 20 06:12:46 olympic kernel: [315042.631774] swapper/0: page allocation
failure: order:1, mode:0x204020
Oct 20 06:12:46 olympic kernel: [315042.635743] swapper/0: page allocation
failure: order:1, mode:0x204020
Oct 20 07:03:32 olympic kernel: [318088.158197] apache2: page allocation
failure: order:1, mode:0x204020
Oct 20 07:03:32 olympic kernel: [318088.160930] apache2: page allocation
failure: order:1, mode:0x204020
Oct 20 07:03:32 olympic kernel: [318088.163835] apache2: page allocation
failure: order:1, mode:0x204020
Oct 20 10:44:54 olympic kernel: [331369.850853] apache2: page allocation
failure: order:1, mode:0x204020
Oct 20 10:44:54 olympic kernel: [331369.851186] apache2: page allocation
failure: order:1, mode:0x204020
Oct 20 10:44:54 olympic kernel: [331369.851357] apache2: page allocation
failure: order:1, mode:0x204020
Oct 20 10:54:04 olympic kernel: [331920.651891] apache2: page allocation
failure: order:1, mode:0x204020

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug 102731] I have a cough.
  2015-08-12  8:47 [Bug 102731] New: I have a cough bugzilla-daemon
                   ` (28 preceding siblings ...)
  2015-10-20 15:55 ` bugzilla-daemon
@ 2015-10-20 16:28 ` bugzilla-daemon
  2015-10-20 16:30 ` bugzilla-daemon
                   ` (15 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: bugzilla-daemon @ 2015-10-20 16:28 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=102731

--- Comment #29 from cwseys@physics.wisc.edu ---
If what we're seeing is the same thing...

8 of 10 VMs with <=1GB which have gone read-only.
0 of 2 VMs with >1GB where the problem has never occurred.

(This is an estimate:  May have missed some which have not gone read-only.)

So let's see what happens!
C.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug 102731] I have a cough.
  2015-08-12  8:47 [Bug 102731] New: I have a cough bugzilla-daemon
                   ` (29 preceding siblings ...)
  2015-10-20 16:28 ` bugzilla-daemon
@ 2015-10-20 16:30 ` bugzilla-daemon
  2015-11-25 10:09 ` bugzilla-daemon
                   ` (14 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: bugzilla-daemon @ 2015-10-20 16:30 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=102731

--- Comment #30 from cwseys@physics.wisc.edu ---
But, FWIW awhile back I had a test VM with 1GB RAM and ran 'stress' on the VM 
for a couple of weeks with no problems.

C.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug 102731] I have a cough.
  2015-08-12  8:47 [Bug 102731] New: I have a cough bugzilla-daemon
                   ` (30 preceding siblings ...)
  2015-10-20 16:30 ` bugzilla-daemon
@ 2015-11-25 10:09 ` bugzilla-daemon
  2016-01-19 12:00 ` bugzilla-daemon
                   ` (13 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: bugzilla-daemon @ 2015-11-25 10:09 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=102731

--- Comment #31 from John Hughes <john@calva.com> ---
Ok, its now been 35 days since the last problem.

I'm currently running:

Host: 4.2.3
Guest: 4.2.3

Unfortunately when I moved the host from 3.16 to 4.2.3 I also changed the
amount of memory available for the guest, poor style for debugging but I was
getting a lot of flack from disgruntled users.

I will now try moving the guest back to 3.16 with 1G of ram to see whether the
problem was solved by changing the host kernel or by changing the guest memory
size.

More news as it becomes available.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug 102731] I have a cough.
  2015-08-12  8:47 [Bug 102731] New: I have a cough bugzilla-daemon
                   ` (31 preceding siblings ...)
  2015-11-25 10:09 ` bugzilla-daemon
@ 2016-01-19 12:00 ` bugzilla-daemon
  2016-01-21 23:57 ` bugzilla-daemon
                   ` (12 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: bugzilla-daemon @ 2016-01-19 12:00 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=102731

--- Comment #32 from John Hughes <john@calva.com> ---
Well I've (finally) rebooted with guest kernel 3.16 and 1Go memory.  It took a
long time to convince the users to risk a (potentially) unstable system after a
long time of all working OK.

For memory -- the problem went away when I moved the host kernel to 4.2.3 and
increased the memory of the guest system to 2Go.  Changing the guest kernel
from 4.2.3 to 3.16 *did not* make the problem go away.

Possible results --

1. problem does not show up, implies the fix is moving the host kernel to 4.2.3

2. problem does show up, implies the fix is either:

   a. increasing the guest memory to 2Go, or

   b. moving the host to 4.2.3 *and* increasing the guest memory to 2Go

   c. moving the guest to 4.2.3 *and* increasing the guest memory to 2Go

   d. moving the host to 4.2.3 *and* moving the guest to 4.2.3

   e. moving the host to 4.2.3 *and* moving the guest to 4.2.3 *and* increasing
      the guest memory to 2Go.

More news as it happens.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug 102731] I have a cough.
  2015-08-12  8:47 [Bug 102731] New: I have a cough bugzilla-daemon
                   ` (32 preceding siblings ...)
  2016-01-19 12:00 ` bugzilla-daemon
@ 2016-01-21 23:57 ` bugzilla-daemon
  2016-01-22 10:27 ` bugzilla-daemon
                   ` (11 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: bugzilla-daemon @ 2016-01-21 23:57 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=102731

--- Comment #33 from cwseys@physics.wisc.edu ---
Hello,
    Could you summarize your findings?  It is hard to keep track after so much 
time.  :)
    Here is a beginning:

Hversion    Gversion  Gmem  read-only?   uptime
3.16           3.16         ????       yes               ~14 days
4.2.3          4.2.3        2 GB      no                ~60 days
4.2.3          3.16         ? GB      inprogress


Thanks!
Chad.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug 102731] I have a cough.
  2015-08-12  8:47 [Bug 102731] New: I have a cough bugzilla-daemon
                   ` (33 preceding siblings ...)
  2016-01-21 23:57 ` bugzilla-daemon
@ 2016-01-22 10:27 ` bugzilla-daemon
  2016-01-22 15:20 ` bugzilla-daemon
                   ` (10 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: bugzilla-daemon @ 2016-01-22 10:27 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=102731

--- Comment #34 from John Hughes <john@calva.com> ---
I first started seeing the problem when I moved the guest kernel to 3.16 
(i.e. started using the ext4 driver for my ext3 fileystems).

1. Host 3.16, Guest 3.16, 1G memory, filesystem goes readonly after 
around 4 days

Then I upgraded the guest to 3.18.19

2. Host 3.16 Guest 3.18.19, 1G memory, filesystem goes readonly after 
around 20 days

Then I upgraded the guest to 4.3.2

3. Host 3.16 Guest 4.3.2, 1G memory, filesystem goes readonly after 4 days

Then I upgraded the host to 4.3.2, noticed the guest had only 1G of 
memory (and the guest kernel was bitching about not having enough memory 
during some TCP operations) so I increased the guest memory to 2G,  Bad 
debugging style, but I was desperate.

4. Host 4.3.2, Guest 4.3.2, 2G memory, system ran for 90 days with no 
problem.

So now I've set the guest to 3.16 with 1G of memory.  We'll see what 
happens next.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug 102731] I have a cough.
  2015-08-12  8:47 [Bug 102731] New: I have a cough bugzilla-daemon
                   ` (34 preceding siblings ...)
  2016-01-22 10:27 ` bugzilla-daemon
@ 2016-01-22 15:20 ` bugzilla-daemon
  2016-01-22 16:36 ` bugzilla-daemon
                   ` (9 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: bugzilla-daemon @ 2016-01-22 15:20 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=102731

--- Comment #35 from cwseys@physics.wisc.edu ---
Thanks!
I think you mistyped 4.2.3 -> 4.3.2 .

Here's the table (helps me see any pattern b/c it is closer together.  Less 
memory required. :) )

Hversion    Gversion  Gmem  time before read-only
3.16        3.16      1 GB  ~4 days
3.16        3.18.19   1 GB  ~20 days
3.16        4.2.3     1 GB  4 days
4.2.3       4.2.3     2 GB  not after ~90 days
4.2.3       3.16      1 GB  inprogress


> Then I upgraded the host to 4.3.2, noticed the guest had only 1G of
> memory (and the guest kernel was bitching about not having enough memory
> during some TCP operations) so I increased the guest memory to 2G,  Bad
> debugging style, but I was desperate.

Debugging and production don't get along well together. :)

> So now I've set the guest to 3.16 with 1G of memory.  We'll see what
> happens next.

I haven't tried changing the host's kernel version yet. Will be interested to 
know how it turns out!  Thanks for debugging and hopefully you'll help us 
(Debian/Ubuntu users) find a workaround at least.

Thanks again,
Chad.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug 102731] I have a cough.
  2015-08-12  8:47 [Bug 102731] New: I have a cough bugzilla-daemon
                   ` (35 preceding siblings ...)
  2016-01-22 15:20 ` bugzilla-daemon
@ 2016-01-22 16:36 ` bugzilla-daemon
  2016-02-08  9:52 ` bugzilla-daemon
                   ` (8 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: bugzilla-daemon @ 2016-01-22 16:36 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=102731

--- Comment #36 from John Hughes <john@calva.com> ---
On 22/01/16 16:20, bugzilla-daemon@bugzilla.kernel.org wrote:
> I think you mistyped 4.2.3 -> 4.3.2 .
Duh, yes.

# uname -a
Linux baltic*4.2.3*-jh1 #2 SMP Thu Oct 15 19:04:09 CEST 2015 x86_64 GNU/Linux

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug 102731] I have a cough.
  2015-08-12  8:47 [Bug 102731] New: I have a cough bugzilla-daemon
                   ` (36 preceding siblings ...)
  2016-01-22 16:36 ` bugzilla-daemon
@ 2016-02-08  9:52 ` bugzilla-daemon
  2016-02-08 10:56 ` bugzilla-daemon
                   ` (7 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: bugzilla-daemon @ 2016-02-08  9:52 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=102731

andre.arnold@gmail.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |andre.arnold@gmail.com

--- Comment #37 from andre.arnold@gmail.com ---
Hey all together,

thanks for the information provided, it helped me a lot fixing my VM crashes in
KVM.

After downgrading the Gversion from 3.16 the crashes were gone.

Chad, whats your experience with your latest test? Running Hversion 4.3.2 and
Gversion on 3.16?

Thanks in advance.

Cheers
André

-- 
You are receiving this mail because:
You are watching the assignee of the bug.--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug 102731] I have a cough.
  2015-08-12  8:47 [Bug 102731] New: I have a cough bugzilla-daemon
                   ` (37 preceding siblings ...)
  2016-02-08  9:52 ` bugzilla-daemon
@ 2016-02-08 10:56 ` bugzilla-daemon
  2016-03-18 22:20 ` bugzilla-daemon
                   ` (6 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: bugzilla-daemon @ 2016-02-08 10:56 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=102731

--- Comment #38 from John Hughes <john@calva.com> ---
My current status is:

Host: 4.2.3
Guest: 3.16.7

Guest uptime: 20 days.

I'm going to leave it like this for some more time, but my current 
feeling is that the ext4 subsystem in a KVM quest is only stable if the 
host kernel is > 4.something.  The ext3 subsystem did not seem to be 
vunerable to whatever the host was doing wrong before 4.whatever.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug 102731] I have a cough.
  2015-08-12  8:47 [Bug 102731] New: I have a cough bugzilla-daemon
                   ` (38 preceding siblings ...)
  2016-02-08 10:56 ` bugzilla-daemon
@ 2016-03-18 22:20 ` bugzilla-daemon
  2016-03-19 17:49 ` bugzilla-daemon
                   ` (5 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: bugzilla-daemon @ 2016-03-18 22:20 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=102731

Vaclav Ovsik <vaclav.ovsik@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |vaclav.ovsik@gmail.com

--- Comment #39 from Vaclav Ovsik <vaclav.ovsik@gmail.com> ---
Hello,
please take a look on
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=818502
IMO the problem is in the hypervisor in 3.16 and older Intel CPU.
Currently I'm testing hypervisor 4.1 and it seems to be OK.
-- 
Zito

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug 102731] I have a cough.
  2015-08-12  8:47 [Bug 102731] New: I have a cough bugzilla-daemon
                   ` (39 preceding siblings ...)
  2016-03-18 22:20 ` bugzilla-daemon
@ 2016-03-19 17:49 ` bugzilla-daemon
  2016-03-20  1:27 ` bugzilla-daemon
                   ` (4 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: bugzilla-daemon @ 2016-03-19 17:49 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=102731

--- Comment #40 from Vaclav Ovsik <vaclav.ovsik@gmail.com> ---
Issue attributes should be changed. This is not about filesystem in VM. The
colleague had fail with btrfs in VM too...


-------- Přeposlaná zpráva --------
Předmět:     [logcheck] ttrss.home.nahorany.net 2016-03-11 15:00 +0100 Security
Events
Datum:     Fri, 11 Mar 2016 15:00:16 +0100 (CET)
Od:     logcheck system account <logcheck@home.nahorany.net>
Komu:     centrino@nahorany.net



Security Events for kernel
=-=-=-=-=-=-=-=-=-=-=-=-=-
Mar 11 14:41:06 ttrss kernel: [2816393.980768] end_request: I/O error, dev vda,
sector 2195232
Mar 11 14:41:06 ttrss kernel: [2816393.980768] end_request: I/O error, dev vda,
sector 2198280
Mar 11 14:41:06 ttrss kernel: [2816393.980768] end_request: I/O error, dev vda,
sector 2973336
Mar 11 14:41:06 ttrss kernel: [2816393.980768] end_request: I/O error, dev vda,
sector 3351896
Mar 11 14:41:06 ttrss kernel: [2816393.980768] end_request: I/O error, dev vda,
sector 2198280
Mar 11 14:41:06 ttrss kernel: [2816393.980768] end_request: I/O error, dev vda,
sector 2973336
Mar 11 14:41:06 ttrss kernel: [2816393.980768] end_request: I/O error, dev vda,
sector 3351896

System Events
=-=-=-=-=-=-=
€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€Mar
11 14:41:36 ttrss sshd[23130]: Received disconnect from 172.16.0.22: 11:
disconnected by user
€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€Mar
11 14:41:06 ttrss kernel: [2816393.980768] end_request: I/O error, dev vda,
sector 2195232
Mar 11 14:23:32 ttrss ansible-command: Invoked with warn=True executable=None
chdir=None _raw_params=apt-get update removes=None creates=None
_uses_shell=False
Mar 11 14:25:14 ttrss ansible-command: Invoked with warn=True executable=None
chdir=None _raw_params=apt-get upgrade -y removes=None creates=None
_uses_shell=False
Mar 11 14:39:19 ttrss kernel: [2816286.989244] device-mapper: uevent: version
1.0.3
Mar 11 14:39:19 ttrss kernel: [2816287.638040] device-mapper: ioctl:
4.27.0-ioctl (2013-10-30) initialised: dm-devel@redhat.com
Mar 11 14:39:19 ttrss kernel: [2816286.989244] device-mapper: uevent: version
1.0.3
Mar 11 14:39:19 ttrss kernel: [2816287.638040] device-mapper: ioctl:
4.27.0-ioctl (2013-10-30) initialised: dm-devel@redhat.com
Mar 11 14:39:26 ttrss kernel: [2816295.076906] SGI XFS with ACLs, security
attributes, realtime, large block/inode numbers, no debug enabled
Mar 11 14:39:26 ttrss kernel: [2816295.076906] SGI XFS with ACLs, security
attributes, realtime, large block/inode numbers, no debug enabled
Mar 11 14:39:27 ttrss kernel: [2816295.173689] JFS: nTxBlock = 1938, nTxLock =
15510
Mar 11 14:39:27 ttrss kernel: [2816295.414255] ntfs: driver 2.1.30 [Flags: R/W
MODULE].
Mar 11 14:39:27 ttrss kernel: [2816295.721879] QNX4 filesystem 0.2.3
registered.
Mar 11 14:39:27 ttrss kernel: [2816295.173689] JFS: nTxBlock = 1938, nTxLock =
15510
Mar 11 14:39:27 ttrss kernel: [2816295.414255] ntfs: driver 2.1.30 [Flags: R/W
MODULE].
Mar 11 14:39:27 ttrss kernel: [2816295.721879] QNX4 filesystem 0.2.3
registered.
Mar 11 14:39:28 ttrss kernel: [2816296.129309] fuse init (API version 7.23)
Mar 11 14:39:28 ttrss kernel: [2816296.129309] fuse init (API version 7.23)
Mar 11 14:39:28 ttrss os-prober: debug: /dev/vda1: is active swap
Mar 11 14:39:29 ttrss ansible-command: Invoked with warn=True executable=None
chdir=None _raw_params=apt-get -y upgrade removes=None creates=None
_uses_shell=False
Mar 11 14:41:06 ttrss kernel: [2816393.980768] BTRFS: bdev /dev/vda2 errs: wr
1, rd 0, flush 0, corrupt 0, gen 0
Mar 11 14:41:06 ttrss kernel: [2816393.980768] BTRFS: bdev /dev/vda2 errs: wr
2, rd 0, flush 0, corrupt 0, gen 0
Mar 11 14:41:06 ttrss kernel: [2816393.980768] BTRFS: bdev /dev/vda2 errs: wr
3, rd 0, flush 0, corrupt 0, gen 0
Mar 11 14:41:06 ttrss kernel: [2816393.980768] BTRFS: bdev /dev/vda2 errs: wr
4, rd 0, flush 0, corrupt 0, gen 0
Mar 11 14:41:06 ttrss kernel: [2816393.980768] BTRFS: bdev /dev/vda2 errs: wr
1, rd 0, flush 0, corrupt 0, gen 0
Mar 11 14:41:06 ttrss kernel: [2816393.980768] BTRFS: bdev /dev/vda2 errs: wr
2, rd 0, flush 0, corrupt 0, gen 0
Mar 11 14:41:06 ttrss kernel: [2816393.980768] BTRFS: bdev /dev/vda2 errs: wr
3, rd 0, flush 0, corrupt 0, gen 0
Mar 11 14:41:06 ttrss kernel: [2816393.980768] BTRFS: bdev /dev/vda2 errs: wr
4, rd 0, flush 0, corrupt 0, gen 0
Mar 11 14:45:30 ttrss ansible-command: Invoked with warn=True executable=None
chdir=None _raw_params=apt-get -y upgrade removes=None creates=None
_uses_shell=False

-- 
Zito

-- 
You are receiving this mail because:
You are watching the assignee of the bug.--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug 102731] I have a cough.
  2015-08-12  8:47 [Bug 102731] New: I have a cough bugzilla-daemon
                   ` (40 preceding siblings ...)
  2016-03-19 17:49 ` bugzilla-daemon
@ 2016-03-20  1:27 ` bugzilla-daemon
  2016-03-20 23:26 ` bugzilla-daemon
                   ` (3 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: bugzilla-daemon @ 2016-03-20  1:27 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=102731

--- Comment #41 from Theodore Tso <tytso@mit.edu> ---
I'm not sure we need to leave this bug open at all.  It sounds like it's some
combination of a buggy Intel CPU and a buggy kernel code in the
KVM/virtualization support.   The fix seems to be to update to a newer host
kernel (4.1 or newer).

This is problematic for Debian users, but at this point it's a bug for the
Debian kernel maintainers, and this bugzilla is for the upstream kernel....

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug 102731] I have a cough.
  2015-08-12  8:47 [Bug 102731] New: I have a cough bugzilla-daemon
                   ` (41 preceding siblings ...)
  2016-03-20  1:27 ` bugzilla-daemon
@ 2016-03-20 23:26 ` bugzilla-daemon
  2016-03-21 13:04 ` bugzilla-daemon
                   ` (2 subsequent siblings)
  45 siblings, 0 replies; 47+ messages in thread
From: bugzilla-daemon @ 2016-03-20 23:26 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=102731

--- Comment #42 from Vaclav Ovsik <vaclav.ovsik@gmail.com> ---
Yes, you are right. I found a commit
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?h=7dec5603b6b8dc4c3e1c65d318bd2a5a8c62a424
using git bisect. I hope this commit can be used for Debian kernel package.
Thanks

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug 102731] I have a cough.
  2015-08-12  8:47 [Bug 102731] New: I have a cough bugzilla-daemon
                   ` (42 preceding siblings ...)
  2016-03-20 23:26 ` bugzilla-daemon
@ 2016-03-21 13:04 ` bugzilla-daemon
  2016-03-25 16:55 ` bugzilla-daemon
  2016-04-08 15:49 ` bugzilla-daemon
  45 siblings, 0 replies; 47+ messages in thread
From: bugzilla-daemon @ 2016-03-21 13:04 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=102731

--- Comment #43 from Theodore Tso <tytso@mit.edu> ---
Vaclav, I want to really, really thank you for doing the bisect.   Looking at
the testing procedure you used for the git bisect here:

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=818502

I'm quite confident you've found the commit that needs to be backported into
the Debian 3.16 kernel.  Given that the commit was added in 3.17, it should be
fairly easy to backport it into the Debian kernel.

It's really too bad this commit hadn't been marked with a
cc:stable@vger.kernel.org tag, but to be fair I've sometimes made this mistake,
either out of sheer forgetfulness or because I didn't recognize the seriousness
of the bugs that a commit could address.   To be fair to the KVM maintainers,
the commit description doesn't really list the potential impacts of the bug it
was fixing.

Also with 20-20 hindsight, it's perhaps unfortunate that during this time
period there is a real divergence[1] of kernels that were used by
distributions.  So a bug that would only show up on certain generations of
Intel chipsets, and only when used in virtualization as a host, is precisely
the sort of bug that it is not likely to be noticed until it goes into
enterprise-wide deployment --- and so the fact that other distributions didn't
standardize on a single kernel in this time period (and Debian standardized on
the oldest kernel in this rough time period, 3.16, and the bug in question was
fixed in 3.17) meant that it's not all that surprising that this slipped
through.   And while it would have been more convenient if Debian had been
willing to switch over to a 3.18 based stable series, it wasn't compatible with
their release schedule.

[1] Ubuntu 14.04 LTS stayed on 3.16 for only 18 months before moving to 3.19; 
Fedora 20 was on 3.11, and Fedora 21 jumped to 3.17, etc.

Vaclav, thanks again for finding a simple/easy repro, and then bisecting until
you found the commit that is needed for the 3.16 debian kernel!

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug 102731] I have a cough.
  2015-08-12  8:47 [Bug 102731] New: I have a cough bugzilla-daemon
                   ` (43 preceding siblings ...)
  2016-03-21 13:04 ` bugzilla-daemon
@ 2016-03-25 16:55 ` bugzilla-daemon
  2016-04-08 15:49 ` bugzilla-daemon
  45 siblings, 0 replies; 47+ messages in thread
From: bugzilla-daemon @ 2016-03-25 16:55 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=102731

--- Comment #44 from cwseys@physics.wisc.edu ---
Any ideas why ext4 in guests with kernel version 3.2.x (Wheezy) did not go 
read-only, but guests with 3.16.x (Jessie) did?

I suppose running a vmhost with kernel 3.16 and bisecting the guest's kernel 
from 3.2 to 3.16 would give us this answer.

But not sure how interesting it would be given that the root cause is KVM not 
simulating real machine correctly.

C.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Bug 102731] I have a cough.
  2015-08-12  8:47 [Bug 102731] New: I have a cough bugzilla-daemon
                   ` (44 preceding siblings ...)
  2016-03-25 16:55 ` bugzilla-daemon
@ 2016-04-08 15:49 ` bugzilla-daemon
  45 siblings, 0 replies; 47+ messages in thread
From: bugzilla-daemon @ 2016-04-08 15:49 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=102731

--- Comment #45 from John Hughes <john@calva.com> ---
Sorry for my lack of reactivity, I've been a bit ill (had a terrible cough,
actually).

I've been running with a 4.2.3 kernel in the host since 19/1/2016 (around 80
days) with no problems.

Yesterday my host was rebooted (power cut) and ran with the 3.16 kernel for one
day and I saw the problem again.

Based on that, and Vaclav's superb research I am convinced that the problem is
in the host system, probably fixed by the patch he found.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2016-04-08 15:49 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-08-12  8:47 [Bug 102731] New: I have a cough bugzilla-daemon
2015-08-12  8:56 ` [Bug 102731] " bugzilla-daemon
2015-08-12  9:02 ` bugzilla-daemon
2015-08-12  9:11 ` bugzilla-daemon
2015-08-12  9:12 ` bugzilla-daemon
2015-08-12 18:53 ` bugzilla-daemon
2015-08-12 19:25 ` bugzilla-daemon
2015-08-31 15:46 ` bugzilla-daemon
2015-08-31 15:47 ` bugzilla-daemon
2015-08-31 18:03 ` bugzilla-daemon
2015-09-01 10:28 ` bugzilla-daemon
2015-09-01 14:43 ` bugzilla-daemon
2015-09-01 16:08 ` bugzilla-daemon
2015-09-16 14:09 ` bugzilla-daemon
2015-09-28 17:06 ` bugzilla-daemon
2015-09-30  9:49 ` bugzilla-daemon
2015-10-07 16:17 ` bugzilla-daemon
2015-10-08  9:16 ` bugzilla-daemon
2015-10-11  4:05 ` bugzilla-daemon
2015-10-12 10:36 ` bugzilla-daemon
2015-10-12 14:01 ` bugzilla-daemon
2015-10-15 15:32 ` bugzilla-daemon
2015-10-15 15:38 ` bugzilla-daemon
2015-10-15 15:41 ` bugzilla-daemon
2015-10-16 13:04 ` bugzilla-daemon
2015-10-16 15:53 ` bugzilla-daemon
2015-10-16 16:14 ` bugzilla-daemon
2015-10-20 13:40 ` bugzilla-daemon
2015-10-20 15:44 ` bugzilla-daemon
2015-10-20 15:55 ` bugzilla-daemon
2015-10-20 16:28 ` bugzilla-daemon
2015-10-20 16:30 ` bugzilla-daemon
2015-11-25 10:09 ` bugzilla-daemon
2016-01-19 12:00 ` bugzilla-daemon
2016-01-21 23:57 ` bugzilla-daemon
2016-01-22 10:27 ` bugzilla-daemon
2016-01-22 15:20 ` bugzilla-daemon
2016-01-22 16:36 ` bugzilla-daemon
2016-02-08  9:52 ` bugzilla-daemon
2016-02-08 10:56 ` bugzilla-daemon
2016-03-18 22:20 ` bugzilla-daemon
2016-03-19 17:49 ` bugzilla-daemon
2016-03-20  1:27 ` bugzilla-daemon
2016-03-20 23:26 ` bugzilla-daemon
2016-03-21 13:04 ` bugzilla-daemon
2016-03-25 16:55 ` bugzilla-daemon
2016-04-08 15:49 ` bugzilla-daemon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.