All of lore.kernel.org
 help / color / mirror / Atom feed
* 3.8.7: general protection fault
@ 2013-05-02 14:45 Bernd Schubert
  2013-05-06  8:14 ` 3.9.0: " Bernd Schubert
  0 siblings, 1 reply; 15+ messages in thread
From: Bernd Schubert @ 2013-05-02 14:45 UTC (permalink / raw)
  To: linux-xfs

I just got this issue on one of my test servers:

> [784650.537576] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
> [784650.540018] Modules linked in: fhgfs(O) fhgfs_client_opentk(O) nfsd xfs ext4 mbcache jbd2 crc16 mlx4_ib mlx4_core ib_umad rdma_ucm rdma_cm iw_cm ib_addr ib_uverbs sg ib_ipoib ib_cm ib_sa loop arcmsr dm_mod md_mod evdev snd_pcm snd_timer snd soundcore tpm_tis snd_page_alloc tpm psmouse tpm_bios pcspkr shpchp serio_raw pci_hotplug ib_mthca ib_mad parport_pc parport amd64_edac_mod ib_core i2c_nforce2 processor k8temp edac_core thermal_sys edac_mce_amd button i2c_core ehci_pci fuse sd_mod crc_t10dif btrfs zlib_deflate crc32c libcrc32c ata_generic e1000 pata_amd sata_nv ohci_hcd floppy ehci_hcd libata scsi_mod usbcore usb_common
> [784650.540018] CPU 1
> [784650.540018] Pid: 247, comm: kworker/1:1H Tainted: G           O 3.8.7 #34 Supermicro H8DCE/H8DCE
> [784650.540018] RIP: 0010:[<ffffffffa05b38ba>]  [<ffffffffa05b38ba>] xfs_trans_ail_delete_bulk+0x7a/0x1d0 [xfs]
> [784650.540018] RSP: 0018:ffff8801f29ddbf8  EFLAGS: 00010202
> [784650.540018] RAX: 0000000000000001 RBX: ffff8801f319bf00 RCX: 0000000000000000
> [784650.540018] RDX: ffff8801f29ddc68 RSI: 6b6b6b6b6b6b6b6b RDI: 6b6b6b6b6b6b6b6b
> [784650.540018] RBP: ffff8801f29ddc48 R08: 0000000000000001 R09: 0000000000000000
> [784650.540018] R10: 0000000000000001 R11: ffffffffa05ccb35 R12: 0000000000000002
> [784650.540018] R13: 0000000000000008 R14: ffff8801f319bf10 R15: ffff8801f7951550
> [784650.540018] FS:  00007f10717f4700(0000) GS:ffff8801ff600000(0000) knlGS:0000000000000000
> [784650.540018] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [784650.540018] CR2: 00007fe7eed59000 CR3: 00000000ae5e4000 CR4: 00000000000007e0
> [784650.540018] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [784650.540018] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [784650.540018] Process kworker/1:1H (pid: 247, threadinfo ffff8801f29dc000, task ffff8801f2f14980)
> [784650.540018] Stack:
> [784650.540018]  ffff880100000000 ffff880100000000 ffff8801f29ddc60 ffff88005ff27838
> [784650.540018]  ffff8801f29ddc48 ffff88014c4c0b20 ffff8801f319bf00 ffff8801f29ddc78
> [784650.540018]  ffff8801f29ddc60 ffff8801f29ddd58 ffff8801f29ddca8 ffffffffa05b3320
> [784650.540018] Call Trace:
> [784650.540018]  [<ffffffffa05b3320>] xfs_iflush_done+0x1c0/0x1f0 [xfs]
> [784650.540018]  [<ffffffffa05b3174>] ? xfs_iflush_done+0x14/0x1f0 [xfs]
> [784650.540018]  [<ffffffffa05b0d8c>] xfs_buf_do_callbacks+0x3c/0x50 [xfs]
> [784650.540018]  [<ffffffffa05b145e>] xfs_buf_iodone_callbacks+0x3e/0x250 [xfs]
> [784650.540018]  [<ffffffffa0554309>] xfs_buf_iodone_work+0x59/0xa0 [xfs]
> [784650.540018]  [<ffffffff8107b744>] process_one_work+0x204/0x550
> [784650.540018]  [<ffffffff8107b6d0>] ? process_one_work+0x190/0x550
> [784650.540018]  [<ffffffff8107ddeb>] worker_thread+0x12b/0x3d0
> [784650.540018]  [<ffffffff8107dcc0>] ? manage_workers+0x2f0/0x2f0
> [784650.540018]  [<ffffffff810836ce>] kthread+0xee/0x100
> [784650.540018]  [<ffffffff810835e0>] ? __init_kthread_worker+0x70/0x70
> [784650.540018]  [<ffffffff8158dabc>] ret_from_fork+0x7c/0xb0
> [784650.540018]  [<ffffffff810835e0>] ? __init_kthread_worker+0x70/0x70
> [784650.540018] Code: 00 00 48 89 f2 31 c0 31 c9 eb 19 66 0f 1f 44 00 00 4c 8b 7a 08 48 83 c2 08 41 f6 47 34 01 0f 84 ad 00 00 00 49 8b 3f 49 8b 77 08 <48> 89 77 08 48 89 3e 48 bf 00 01 10 00 00 00 ad de 48 be 00 02
> [784650.540018] RIP  [<ffffffffa05b38ba>] xfs_trans_ail_delete_bulk+0x7a/0x1d0 [xfs]
> [784650.540018]  RSP <ffff8801f29ddbf8>
> [784650.849642] ---[ end trace 38cae66ea9b6d0f5 ]---
> [784650.854439] BUG: sleeping function called from invalid context at kernel/rwsem.c:20
> [784650.862280] in_atomic(): 1, irqs_disabled(): 0, pid: 247, name: kworker/1:1H

[... further messages and tracing causing a panic skipped]

I can resolve line numbers if required, but I don't have any minute to 
look into it myself.


Cheers,
Bernd






_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* 3.9.0: general protection fault
  2013-05-02 14:45 3.8.7: general protection fault Bernd Schubert
@ 2013-05-06  8:14 ` Bernd Schubert
  2013-05-06  9:40   ` Bernd Schubert
  2013-05-06 12:28   ` Dave Chinner
  0 siblings, 2 replies; 15+ messages in thread
From: Bernd Schubert @ 2013-05-06  8:14 UTC (permalink / raw)
  To: linux-xfs

And anpther protection fault, this time with 3.9.0. Always happens on 
one of the servers. Its ECC memory, so I don't suspect a faulty memory 
bank. Going to fsck now-


> [303340.514052] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
> [303340.517913] Modules linked in: fhgfs(O) fhgfs_client_opentk(O) nfsd xfs ext4 mbcache jbd2 crc16 sg mlx4_ib mlx4_core ib_umad rdma_ucm rdma_cm iw_cm ib_addr ib_uverbs ib_ipoib ib_cm ib_sa loop arcmsr dm_mod md_mod evdev tpm_tis tpm tpm_bios psmouse serio_raw shpchp pcspkr pci_hotplug amd64_edac_mod edac_core edac_mce_amd k8temp ib_mthca ib_mad ib_core processor ehci_pci thermal_sys button i2c_nforce2 i2c_core fuse btrfs sd_mod crc_t10dif raid6_pq xor zlib_deflate crc32c libcrc32c ata_generic pata_amd sata_nv ohci_hcd libata e1000 ehci_hcd scsi_mod floppy
> [303340.532909] CPU 1
> [303340.532909] Pid: 256, comm: kworker/1:1H Tainted: G           O 3.9.0-debug+ #10 Supermicro H8DCE/H8DCE
> [303340.532909] RIP: 0010:[<ffffffff812d45d4>]  [<ffffffff812d45d4>] __list_del_entry+0x76/0xd4
> [303340.532909] RSP: 0018:ffff8801f502dbb8  EFLAGS: 00010a83
> [303340.532909] RAX: 6b6b6b6b6b6b6b6b RBX: ffff880099159c18 RCX: dead000000200200
> [303340.532909] RDX: 6b6b6b6b6b6b6b6b RSI: ffff880099159c18 RDI: ffff880099159c18
> [303340.532909] RBP: ffff8801f502dbb8 R08: ffff8801a2c1de08 R09: 0000000000000000
> [303340.532909] R10: 0000000000000008 R11: ffffffff81608e11 R12: ffff8800b65ddf00
> [303340.532909] R13: 0000000000000001 R14: 0000000000000001 R15: ffff8800b65ddf00
> [303340.532909] FS:  00007fcac7da5700(0000) GS:ffff8801fe800000(0000) knlGS:0000000000000000
> [303340.532909] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [303340.532909] CR2: ffffffffff600400 CR3: 00000000ae02e000 CR4: 00000000000007e0
> [303340.532909] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [303340.532909] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [303340.532909] Process kworker/1:1H (pid: 256, threadinfo ffff8801f502c000, task ffff8801f5029480)
> [303340.532909] Stack:
> [303340.532909]  ffff8801f502dbd8 ffffffff812d4643 ffff8801f502dc28 ffff880099159c18
> [303340.532909]  ffff8801f502dbf8 ffffffffa053876c ffff880099159c18 ffff8801f502dc78
> [303340.532909]  ffff8801f502dc58 ffffffffa053883e ffff8800b65ddf58 00000008f502dc70
> [303340.532909] Call Trace:
> [303340.532909]  [<ffffffff812d4643>] list_del+0x11/0x33
> [303340.532909]  [<ffffffffa053876c>] xfs_ail_delete+0x24/0x3a [xfs]
> [303340.532909]  [<ffffffffa053883e>] xfs_trans_ail_delete_bulk+0xbc/0x149 [xfs]
> [303340.532909]  [<ffffffffa05382a3>] xfs_iflush_done+0x121/0x1be [xfs]
> [303340.532909]  [<ffffffffa0538196>] ? xfs_iflush_done+0x14/0x1be [xfs]
> [303340.532909]  [<ffffffffa0535cb0>] xfs_buf_do_callbacks+0x39/0x4c [xfs]
> [303340.532909]  [<ffffffffa0536707>] xfs_buf_iodone_callbacks+0x2bd/0x2ee [xfs]
> [303340.532909]  [<ffffffffa04d8e9a>] xfs_buf_iodone_work+0x46/0x76 [xfs]
> [303340.532909]  [<ffffffff8107740f>] process_one_work+0x2ff/0x4f5
> [303340.532909]  [<ffffffff81077355>] ? process_one_work+0x245/0x4f5
> [303340.532909]  [<ffffffff810782bf>] ? worker_thread+0x56/0x2d4
> [303340.532909]  [<ffffffff8107847c>] worker_thread+0x213/0x2d4
> [303340.532909]  [<ffffffff81078269>] ? busy_worker_rebind_fn+0x92/0x92
> [303340.532909]  [<ffffffff8107e150>] kthread+0xe1/0xe9
> [303340.532909]  [<ffffffff81563483>] ? _raw_spin_unlock_irq+0x30/0x4e
> [303340.532909]  [<ffffffff8107e06f>] ? __init_kthread_worker+0x6b/0x6b
> [303340.532909]  [<ffffffff8156be7c>] ret_from_fork+0x7c/0xb0
> [303340.532909]  [<ffffffff8107e06f>] ? __init_kthread_worker+0x6b/0x6b
> [303340.532909] Code: de 48 39 c8 75 25 49 89 c8 48 89 f9 48 c7 c2 28 6c 7f 81 be 38 00 00 00 48 c7 c7 72 66 7f 81 b8 00 00 00 00 e8 18 22 d8 ff eb 5c <4c> 8b 00 4c 39 c7 74 22 48 89 f9 48 c7 c2 60 6c 7f 81 be 3b 00
> [303340.532909] RIP  [<ffffffff812d45d4>] __list_del_entry+0x76/0xd4
> [303340.532909]  RSP <ffff8801f502dbb8>
> [303340.848041] ---[ end trace 0d0b800e14608360 ]---


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 3.9.0: general protection fault
  2013-05-06  8:14 ` 3.9.0: " Bernd Schubert
@ 2013-05-06  9:40   ` Bernd Schubert
  2013-05-06 12:28   ` Dave Chinner
  1 sibling, 0 replies; 15+ messages in thread
From: Bernd Schubert @ 2013-05-06  9:40 UTC (permalink / raw)
  Cc: linux-xfs

On 05/06/2013 10:14 AM, Bernd Schubert wrote:
> And anpther protection fault, this time with 3.9.0. Always happens on
> one of the servers. Its ECC memory, so I don't suspect a faulty memory
> bank. Going to fsck now-


xfs_repair didn't find anything.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 3.9.0: general protection fault
  2013-05-06  8:14 ` 3.9.0: " Bernd Schubert
  2013-05-06  9:40   ` Bernd Schubert
@ 2013-05-06 12:28   ` Dave Chinner
  2013-05-06 12:47     ` Bernd Schubert
  1 sibling, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2013-05-06 12:28 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: linux-xfs

On Mon, May 06, 2013 at 10:14:22AM +0200, Bernd Schubert wrote:
> And anpther protection fault, this time with 3.9.0. Always happens
> on one of the servers. Its ECC memory, so I don't suspect a faulty
> memory bank. Going to fsck now-

http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F

> 
> >[303340.514052] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
> >[303340.517913] Modules linked in: fhgfs(O) fhgfs_client_opentk(O)

Kernel tainted with out of tree modules. Can you reproduce the
problem with them?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 3.9.0: general protection fault
  2013-05-06 12:28   ` Dave Chinner
@ 2013-05-06 12:47     ` Bernd Schubert
  2013-05-07  1:12       ` Dave Chinner
  0 siblings, 1 reply; 15+ messages in thread
From: Bernd Schubert @ 2013-05-06 12:47 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On 05/06/2013 02:28 PM, Dave Chinner wrote:
> On Mon, May 06, 2013 at 10:14:22AM +0200, Bernd Schubert wrote:
>> And anpther protection fault, this time with 3.9.0. Always happens
>> on one of the servers. Its ECC memory, so I don't suspect a faulty
>> memory bank. Going to fsck now-
>
> http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F

Isn't that a bit overhead? And I can't provide /proc/meminfo and others, 
as this issue causes a kernel panic a few traces later.

>
>>
>>> [303340.514052] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
>>> [303340.517913] Modules linked in: fhgfs(O) fhgfs_client_opentk(O)
>
> Kernel tainted with out of tree modules. Can you reproduce the
> problem with them?

The modules are unused, as this is the server side. I disabled client 
packages now and will re-run. But I really think that we should look for 
memory/list corruption outside of fhgfs. Also very unlikely that always 
only xfs would suffer, as there is also running ext4 for fhgfs meta data.
Also, it took from Friday evening till this morning to run into the 
crash, so the next occurance might take some time. And I think tracing 
xfs is out of question, as I need the disk space to store data (the 
client side is running our stress test suite).


Cheers,
Bernd

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 3.9.0: general protection fault
  2013-05-06 12:47     ` Bernd Schubert
@ 2013-05-07  1:12       ` Dave Chinner
  2013-05-07 11:18         ` Bernd Schubert
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2013-05-07  1:12 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: linux-xfs

On Mon, May 06, 2013 at 02:47:31PM +0200, Bernd Schubert wrote:
> On 05/06/2013 02:28 PM, Dave Chinner wrote:
> >On Mon, May 06, 2013 at 10:14:22AM +0200, Bernd Schubert wrote:
> >>And anpther protection fault, this time with 3.9.0. Always happens
> >>on one of the servers. Its ECC memory, so I don't suspect a faulty
> >>memory bank. Going to fsck now-
> >
> >http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F
> 
> Isn't that a bit overhead? And I can't provide /proc/meminfo and
> others, as this issue causes a kernel panic a few traces later.

Provide what information you can.  Without knowing a single thing
about your hardware, storage config and workload, I can't help you
at all. You're asking me to find a needle in a haystack blindfolded
and with both hands tied behind my back....

Stuff like /proc/meminfo doesn't have to be provided from exactly
the time of the crash - it's just the simplest way to find out how
much RAM you have in the machine, so a dump from whenever the
machine is up and running the workload is fine. Other information we
ask for (e.g. capturing the output of `vmstat 5` as suggested in the
FAQ) gives us the runtime variation of memory usage and easy to
capture right up to the failure point....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 3.9.0: general protection fault
  2013-05-07  1:12       ` Dave Chinner
@ 2013-05-07 11:18         ` Bernd Schubert
  2013-05-07 22:07           ` Dave Chinner
  0 siblings, 1 reply; 15+ messages in thread
From: Bernd Schubert @ 2013-05-07 11:18 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

[-- Attachment #1: Type: text/plain, Size: 2114 bytes --]

On 05/07/2013 03:12 AM, Dave Chinner wrote:
> On Mon, May 06, 2013 at 02:47:31PM +0200, Bernd Schubert wrote:
>> On 05/06/2013 02:28 PM, Dave Chinner wrote:
>>> On Mon, May 06, 2013 at 10:14:22AM +0200, Bernd Schubert wrote:
>>>> And anpther protection fault, this time with 3.9.0. Always happens
>>>> on one of the servers. Its ECC memory, so I don't suspect a faulty
>>>> memory bank. Going to fsck now-
>>>
>>> http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F
>>
>> Isn't that a bit overhead? And I can't provide /proc/meminfo and
>> others, as this issue causes a kernel panic a few traces later.
>
> Provide what information you can.  Without knowing a single thing
> about your hardware, storage config and workload, I can't help you
> at all. You're asking me to find a needle in a haystack blindfolded
> and with both hands tied behind my back....

I see that xfs_info, meminfo, etc are useful, but /proc/mounts? Maybe 
you want "cat /proc/mounts | grep xfs"?. Attached is the output of 
/proc/mounts, please let me know if you were really interested in all of 
that non-xfs output?

And I just wonder what you are going to do with the information about 
the hardware. So it is an Areca hw-raid5 device with 9 disks. But does 
this help? It doesn't tell if one of the disks reads/writes with hickups 
or provides any performance characteristics at all.


>
> Stuff like /proc/meminfo doesn't have to be provided from exactly
> the time of the crash - it's just the simplest way to find out how
> much RAM you have in the machine, so a dump from whenever the
> machine is up and running the workload is fine. Other information we
> ask for (e.g. capturing the output of `vmstat 5` as suggested in the
> FAQ) gives us the runtime variation of memory usage and easy to
> capture right up to the failure point...

I have started collectl now, it logs meminfo and other useful 
information. But still with all of that, are you sure xfs debugging 
information wouldn't be more useful? For example setting a
"#define debug" in xfs_trans_ail.c?


Cheers,
Bernd





[-- Attachment #2: mounts.txt --]
[-- Type: text/plain, Size: 3583 bytes --]

rootfs / rootfs rw 0 0
sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
udev /dev devtmpfs rw,relatime,size=3482172k,nr_inodes=870543,mode=755 0 0
devpts /dev/pts devpts rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0
tmpfs /run tmpfs rw,nosuid,relatime,size=1406060k,mode=755 0 0
192.168.40.150:/chroots/squeeze64 / nfs rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,nolock,proto=tcp,port=2049,timeo=7,retrans=10,sec=sys,local_lock=all,addr=192.168.40.150 0 0
tmpfs /tmp tmpfs rw,relatime 0 0
tmpfs /lib/init/rw tmpfs rw,nosuid,relatime,mode=755 0 0
172.18.25.3://scratch/unionfs/groups/squeeze /unionfs/group nfs rw,relatime,vers=3,rsize=8192,wsize=8192,namlen=255,hard,nolock,proto=tcp,port=2049,timeo=600,retrans=2,sec=sys,mountaddr=172.18.25.3,mountvers=3,mountport=52204,mountproto=tcp,local_lock=all,addr=172.18.25.3 0 0
172.18.25.3://scratch/unionfs/hosts/192.168.40.112 /unionfs/host nfs rw,relatime,vers=3,rsize=8192,wsize=8192,namlen=255,hard,nolock,proto=tcp,port=2049,timeo=600,retrans=2,sec=sys,mountaddr=172.18.25.3,mountvers=3,mountport=52204,mountproto=tcp,local_lock=all,addr=172.18.25.3 0 0
192.168.40.150:/chroots/squeeze64/root /unionfs/common/root nfs rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,nolock,proto=tcp,port=2049,timeo=7,retrans=10,sec=sys,local_lock=all,addr=192.168.40.150 0 0
unionfs-fuse /unionfs/union/root fuse.unionfs-fuse rw,relatime,user_id=0,group_id=0,default_permissions,allow_other 0 0
unionfs-fuse /root fuse.unionfs-fuse rw,relatime,user_id=0,group_id=0,default_permissions,allow_other 0 0
192.168.40.150:/chroots/squeeze64/etc /unionfs/common/etc nfs rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,nolock,proto=tcp,port=2049,timeo=7,retrans=10,sec=sys,local_lock=all,addr=192.168.40.150 0 0
unionfs-fuse /unionfs/union/etc fuse.unionfs-fuse rw,relatime,user_id=0,group_id=0,default_permissions,allow_other 0 0
unionfs-fuse /etc fuse.unionfs-fuse rw,relatime,user_id=0,group_id=0,default_permissions,allow_other 0 0
192.168.40.150:/chroots/squeeze64/var /unionfs/common/var nfs rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,nolock,proto=tcp,port=2049,timeo=7,retrans=10,sec=sys,local_lock=all,addr=192.168.40.150 0 0
unionfs-fuse /unionfs/union/var fuse.unionfs-fuse rw,relatime,user_id=0,group_id=0,default_permissions,allow_other 0 0
unionfs-fuse /var fuse.unionfs-fuse rw,relatime,user_id=0,group_id=0,default_permissions,allow_other 0 0
192.168.40.150:/chroots/squeeze64/opt /unionfs/common/opt nfs rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,nolock,proto=tcp,port=2049,timeo=7,retrans=10,sec=sys,local_lock=all,addr=192.168.40.150 0 0
unionfs-fuse /unionfs/union/opt fuse.unionfs-fuse rw,relatime,user_id=0,group_id=0,default_permissions,allow_other 0 0
unionfs-fuse /opt fuse.unionfs-fuse rw,relatime,user_id=0,group_id=0,default_permissions,allow_other 0 0
tmpfs /dev/shm tmpfs rw,nosuid,nodev,relatime 0 0
/dev/sdc /data/fhgfs/meta ext4 rw,relatime,journal_checksum,journal_async_commit,nobarrier,data=writeback 0 0
/dev/sdb /data/fhgfs/storage1 xfs rw,relatime,attr2,inode64,logbsize=128k,sunit=256,swidth=2048,noquota 0 0
debugfs /sys/kernel/debug debugfs rw,relatime 0 0
rpc_pipefs /var/lib/nfs/rpc_pipefs rpc_pipefs rw,relatime 0 0
fusectl /sys/fs/fuse/connections fusectl rw,relatime 0 0
nfsd /proc/fs/nfsd nfsd rw,relatime 0 0
fsdevel3:/home/schubert/src /home/schubert/src fuse.sshfs rw,nosuid,nodev,relatime,user_id=5741,group_id=2130,allow_other,max_read=65536 0 0


[-- Attachment #3: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 3.9.0: general protection fault
  2013-05-07 11:18         ` Bernd Schubert
@ 2013-05-07 22:07           ` Dave Chinner
  2013-05-08 17:48             ` Bernd Schubert
  2013-05-09  7:16             ` Stan Hoeppner
  0 siblings, 2 replies; 15+ messages in thread
From: Dave Chinner @ 2013-05-07 22:07 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: linux-xfs

On Tue, May 07, 2013 at 01:18:13PM +0200, Bernd Schubert wrote:
> On 05/07/2013 03:12 AM, Dave Chinner wrote:
> >On Mon, May 06, 2013 at 02:47:31PM +0200, Bernd Schubert wrote:
> >>On 05/06/2013 02:28 PM, Dave Chinner wrote:
> >>>On Mon, May 06, 2013 at 10:14:22AM +0200, Bernd Schubert wrote:
> >>>>And anpther protection fault, this time with 3.9.0. Always happens
> >>>>on one of the servers. Its ECC memory, so I don't suspect a faulty
> >>>>memory bank. Going to fsck now-
> >>>
> >>>http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F
> >>
> >>Isn't that a bit overhead? And I can't provide /proc/meminfo and
> >>others, as this issue causes a kernel panic a few traces later.
> >
> >Provide what information you can.  Without knowing a single thing
> >about your hardware, storage config and workload, I can't help you
> >at all. You're asking me to find a needle in a haystack blindfolded
> >and with both hands tied behind my back....
> 
> I see that xfs_info, meminfo, etc are useful, but /proc/mounts?
> Maybe you want "cat /proc/mounts | grep xfs"?. Attached is the
> output of /proc/mounts, please let me know if you were really
> interested in all of that non-xfs output?

Yes. You never know what is relevant to a problem that is reported,
especially if there are multiple filesystems sharing the same
device...

> And I just wonder what you are going to do with the information
> about the hardware. So it is an Areca hw-raid5 device with 9 disks.
> But does this help? It doesn't tell if one of the disks reads/writes
> with hickups or provides any performance characteristics at all.

Yes, it does, because Areca cards are by far the most unreliable HW
RAID you can buy, which is not surprising because they are also the
cheapest. This is through experience - we see reports of filesystems
being badly corrupted ever few months because of problems with Areca
controllers.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 3.9.0: general protection fault
  2013-05-07 22:07           ` Dave Chinner
@ 2013-05-08 17:48             ` Bernd Schubert
  2013-05-09  0:41               ` Dave Chinner
  2013-05-09  7:16             ` Stan Hoeppner
  1 sibling, 1 reply; 15+ messages in thread
From: Bernd Schubert @ 2013-05-08 17:48 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On 05/08/2013 12:07 AM, Dave Chinner wrote:
> On Tue, May 07, 2013 at 01:18:13PM +0200, Bernd Schubert wrote:
>> On 05/07/2013 03:12 AM, Dave Chinner wrote:
>>> On Mon, May 06, 2013 at 02:47:31PM +0200, Bernd Schubert wrote:
>>>> On 05/06/2013 02:28 PM, Dave Chinner wrote:
>>>>> On Mon, May 06, 2013 at 10:14:22AM +0200, Bernd Schubert wrote:
>>>>>> And anpther protection fault, this time with 3.9.0. Always happens
>>>>>> on one of the servers. Its ECC memory, so I don't suspect a faulty
>>>>>> memory bank. Going to fsck now-
>>>>>
>>>>> http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F
>>>>
>>>> Isn't that a bit overhead? And I can't provide /proc/meminfo and
>>>> others, as this issue causes a kernel panic a few traces later.
>>>
>>> Provide what information you can.  Without knowing a single thing
>>> about your hardware, storage config and workload, I can't help you
>>> at all. You're asking me to find a needle in a haystack blindfolded
>>> and with both hands tied behind my back....
>>
>> I see that xfs_info, meminfo, etc are useful, but /proc/mounts?
>> Maybe you want "cat /proc/mounts | grep xfs"?. Attached is the
>> output of /proc/mounts, please let me know if you were really
>> interested in all of that non-xfs output?
>
> Yes. You never know what is relevant to a problem that is reported,
> especially if there are multiple filesystems sharing the same
> device...

Hmm, I see. But you need to extend your questions to multipathing and 
shared storage. Both time you can easily get double mounts... I probably 
should try to find some time to add ext4s MMP to XFS.

>
>> And I just wonder what you are going to do with the information
>> about the hardware. So it is an Areca hw-raid5 device with 9 disks.
>> But does this help? It doesn't tell if one of the disks reads/writes
>> with hickups or provides any performance characteristics at all.
>
> Yes, it does, because Areca cards are by far the most unreliable HW
> RAID you can buy, which is not surprising because they are also the

Ahem. Compared to other hardware raids Areca is very stable.

> cheapest. This is through experience - we see reports of filesystems
> being badly corrupted ever few months because of problems with Areca
> controllers.

The problem is that telling the hardware controller does not tell 
anything about disks. And most raid solutions do not care at all about 
disk corruptions, thats getting better with T10DIF/DX, but unfortunately 
I still don't see that used most installations.
As I'm aware of that problem for several years we started to write 
ql-fstest [1] several years ago, which checks for data corruption. That 
is also part of our stress test suite and so far it didn't report 
anything. So we can exclude disks/controller data corruption with a very 
high probability.

You might want to add to your FAQ something like:

Q: Are you sure there is not disk / controller / memory data corruption? 
If so please state why!



Cheers,
Bernd


[1] https://bitbucket.org/aakef/ql-fstest






_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 3.9.0: general protection fault
  2013-05-08 17:48             ` Bernd Schubert
@ 2013-05-09  0:41               ` Dave Chinner
  2013-05-10 10:19                 ` Bernd Schubert
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2013-05-09  0:41 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: linux-xfs

On Wed, May 08, 2013 at 07:48:04PM +0200, Bernd Schubert wrote:
> On 05/08/2013 12:07 AM, Dave Chinner wrote:
> >On Tue, May 07, 2013 at 01:18:13PM +0200, Bernd Schubert wrote:
> >>On 05/07/2013 03:12 AM, Dave Chinner wrote:
> >>>On Mon, May 06, 2013 at 02:47:31PM +0200, Bernd Schubert wrote:
> >>>>On 05/06/2013 02:28 PM, Dave Chinner wrote:
> >>>>>On Mon, May 06, 2013 at 10:14:22AM +0200, Bernd Schubert wrote:
> >>>>>>And anpther protection fault, this time with 3.9.0. Always happens
> >>>>>>on one of the servers. Its ECC memory, so I don't suspect a faulty
> >>>>>>memory bank. Going to fsck now-
> >>>>>
> >>>>>http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F
> >>>>
> >>>>Isn't that a bit overhead? And I can't provide /proc/meminfo and
> >>>>others, as this issue causes a kernel panic a few traces later.
> >>>
> >>>Provide what information you can.  Without knowing a single thing
> >>>about your hardware, storage config and workload, I can't help you
> >>>at all. You're asking me to find a needle in a haystack blindfolded
> >>>and with both hands tied behind my back....
> >>
> >>I see that xfs_info, meminfo, etc are useful, but /proc/mounts?
> >>Maybe you want "cat /proc/mounts | grep xfs"?. Attached is the
> >>output of /proc/mounts, please let me know if you were really
> >>interested in all of that non-xfs output?
> >
> >Yes. You never know what is relevant to a problem that is reported,
> >especially if there are multiple filesystems sharing the same
> >device...
> 
> Hmm, I see. But you need to extend your questions to multipathing
> and shared storage.

why would we? Anyone using such a configuration reporting a bug
usually is clueful enough to mention it in their bug report when
describing their RAID/LVM setup.  The FAQ entry covers the basic
information needed to start meaingful triage, not *all* the
infomration we might ask for. It's the baseline we start from.

Indeed, the FAQ exists because I got sick of asking people for the
same information several times a week, every week in response to
poor bug reports like yours. it's far more efficient to paste a link
several times a week.  i.e. The FAQ entry is there for my benefit,
not yours.

I don't really care if you don't understand why we are asking for
that information, I simply expect you to provide it as best you can
if you want your problem solved.

> Both time you can easily get double mounts... I
> probably should try to find some time to add ext4s MMP to XFS.

Doesn't solve the problem. It doesn't prevent multiple write access
to the lun:

	Ah, a free lun. I'll just put LVM on it and mkfs it and....
	Oh, sorry, were you using that lun?

So, naive hacks like MMP don't belong in filesystems....

> >>And I just wonder what you are going to do with the information
> >>about the hardware. So it is an Areca hw-raid5 device with 9 disks.
> >>But does this help? It doesn't tell if one of the disks reads/writes
> >>with hickups or provides any performance characteristics at all.
> >
> >Yes, it does, because Areca cards are by far the most unreliable HW
> >RAID you can buy, which is not surprising because they are also the
> 
> Ahem. Compared to other hardware raids Areca is very stable.

Maybe in your experience. We get a report every 3-4 months about
Areca hardware causing catastrophic data loss. It outnumbers every
other type of hardware RAID by at least 10:1 when it comes to such
problem reports.

> You might want to add to your FAQ something like:
> 
> Q: Are you sure there is not disk / controller / memory data
> corruption? If so please state why!

No, the FAQ entry is for gathering facts and data, not what
the bug reporter *thinks* might be the problem. If there's
corruption we'll see it in the information that is gathered, and
then we can start to look for the source.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 3.9.0: general protection fault
  2013-05-07 22:07           ` Dave Chinner
  2013-05-08 17:48             ` Bernd Schubert
@ 2013-05-09  7:16             ` Stan Hoeppner
  1 sibling, 0 replies; 15+ messages in thread
From: Stan Hoeppner @ 2013-05-09  7:16 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Bernd Schubert, linux-xfs

On 5/7/2013 5:07 PM, Dave Chinner wrote:

>> And I just wonder what you are going to do with the information
>> about the hardware. So it is an Areca hw-raid5 device with 9 disks.
>> But does this help? It doesn't tell if one of the disks reads/writes
>> with hickups or provides any performance characteristics at all.
> 
> Yes, it does, because Areca cards are by far the most unreliable HW
> RAID you can buy, which is not surprising because they are also the
> cheapest. This is through experience - we see reports of filesystems
> being badly corrupted ever few months because of problems with Areca
> controllers.

And the sad part is that they're not that much lower priced than a
comparable LSI card, at least in N. America.  Newegg sells a 28 port LSI
for $1500 and a 28 port Areca for $1300--a paltry 13% difference.  Areca
packs 8x more DRAM onto this board--4GB vs 512MB--via a standard DIMM
socket, and touts the larger RAM capacity and expandability as a big
performance booster.  AIUI this is only partially true.  The larger
capacity for the most part simply helps their weak firmware keep pace
with some workloads, mainly large streaming, but the random IO
performance of the Areca's isn't all that great, and regardless of the
size DIMM once inserts random performance doesn't change much.

Regarding reliability, it's interesting to note that the RAID card
industry as a whole began moving away from standard socketed DRAM quite
some time ago.  When a manufacturer solders DRAM chips to the board they
have direct control over memory quality and the testing/verification
process of the finished product.  So when the customer installs and uses
the board there are no surprises here.  With standard DIMM socketed
boards the customer can insert any DIMM s/he wishes, and there's no
guarantee of the quality/reliability of the DIMM, nor the complete unit.
 AFAIK Areca has the only line of RAID cards on the market with a socket
for standard DRAM.  HP uses a socket design but the daughterboard holds
more than just DRAM, and you must use HP's daughterboard, thus they
control quality.  I wonder how many of the people who have problems with
their Areca board had inserted aftermarket DIMMs, vs those using factory
memory who simply ran into firmware or board QC problems.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 3.9.0: general protection fault
  2013-05-09  0:41               ` Dave Chinner
@ 2013-05-10 10:19                 ` Bernd Schubert
  2013-05-10 13:33                   ` Eric Sandeen
  2013-05-11  0:12                   ` Dave Chinner
  0 siblings, 2 replies; 15+ messages in thread
From: Bernd Schubert @ 2013-05-10 10:19 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On 05/09/2013 02:41 AM, Dave Chinner wrote:
> On Wed, May 08, 2013 at 07:48:04PM +0200, Bernd Schubert wrote:
>> On 05/08/2013 12:07 AM, Dave Chinner wrote:
>>> On Tue, May 07, 2013 at 01:18:13PM +0200, Bernd Schubert wrote:
>>>> On 05/07/2013 03:12 AM, Dave Chinner wrote:
>>>>> On Mon, May 06, 2013 at 02:47:31PM +0200, Bernd Schubert wrote:
>>>>>> On 05/06/2013 02:28 PM, Dave Chinner wrote:
>>>>>>> On Mon, May 06, 2013 at 10:14:22AM +0200, Bernd Schubert wrote:
>>>>>>>> And anpther protection fault, this time with 3.9.0. Always happens
>>>>>>>> on one of the servers. Its ECC memory, so I don't suspect a faulty
>>>>>>>> memory bank. Going to fsck now-
>>>>>>>
>>>>>>> http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F
>>>>>>
>>>>>> Isn't that a bit overhead? And I can't provide /proc/meminfo and
>>>>>> others, as this issue causes a kernel panic a few traces later.
>>>>>
>>>>> Provide what information you can.  Without knowing a single thing
>>>>> about your hardware, storage config and workload, I can't help you
>>>>> at all. You're asking me to find a needle in a haystack blindfolded
>>>>> and with both hands tied behind my back....
>>>>
>>>> I see that xfs_info, meminfo, etc are useful, but /proc/mounts?
>>>> Maybe you want "cat /proc/mounts | grep xfs"?. Attached is the
>>>> output of /proc/mounts, please let me know if you were really
>>>> interested in all of that non-xfs output?
>>>
>>> Yes. You never know what is relevant to a problem that is reported,
>>> especially if there are multiple filesystems sharing the same
>>> device...
>>
>> Hmm, I see. But you need to extend your questions to multipathing
>> and shared storage.
>
> why would we? Anyone using such a configuration reporting a bug
> usually is clueful enough to mention it in their bug report when
> describing their RAID/LVM setup.  The FAQ entry covers the basic
> information needed to start meaingful triage, not *all* the
> infomration we might ask for. It's the baseline we start from.
>
> Indeed, the FAQ exists because I got sick of asking people for the
> same information several times a week, every week in response to
> poor bug reports like yours. it's far more efficient to paste a link
> several times a week.  i.e. The FAQ entry is there for my benefit,
> not yours.

Poor bug report or not, most information you ask about in the FAQ are 
entirely irrelevant for this issue.

>
> I don't really care if you don't understand why we are asking for
> that information, I simply expect you to provide it as best you can
> if you want your problem solved.

And here we go, the bug I reported is not my problem. I simply reported 
a bug in XFS. You can use it or not, I do not care at all. This is not a 
production system and if XFS is not running sufficiently stable I'm 
simply going to switch to another file system.
Of course I don't like bugs and I'm going to help to fix it, but I have 
a long daily todo list and I'm not going to spend my time filling in 
irrelevent items.

>
>> Both time you can easily get double mounts... I
>> probably should try to find some time to add ext4s MMP to XFS.
>
> Doesn't solve the problem. It doesn't prevent multiple write access
> to the lun:
>
> 	Ah, a free lun. I'll just put LVM on it and mkfs it and....
> 	Oh, sorry, were you using that lun?

MMP is not about human mistakes. MMP is an *additional* protection for 
software managed shared storage devices. If your HA software runs into a 
bug or gets split brain for some reasons you easily get a double mount. 
The fact the MMP also protects against a few human errors is just a nice 
addon.

>
> So, naive hacks like MMP don't belong in filesystems....#

Maybe there are better solutions, but it works fine as it is.

>
>>>> And I just wonder what you are going to do with the information
>>>> about the hardware. So it is an Areca hw-raid5 device with 9 disks.
>>>> But does this help? It doesn't tell if one of the disks reads/writes
>>>> with hickups or provides any performance characteristics at all.
>>>
>>> Yes, it does, because Areca cards are by far the most unreliable HW
>>> RAID you can buy, which is not surprising because they are also the
>>
>> Ahem. Compared to other hardware raids Areca is very stable.
>
> Maybe in your experience. We get a report every 3-4 months about
> Areca hardware causing catastrophic data loss. It outnumbers every
> other type of hardware RAID by at least 10:1 when it comes to such
> problem reports.

The number of your reports simply correlates to the number of installed 
areca controllers. The vendor I'm talking about only has externally 
connected boxes and isn't that much used as areca. And don't get me 
wrong, I don't want to defend Areca at all. And personally I don't like 
any of these cheap raid solutions at all for several reasons (e.g. no 
disk latency stats, no parity verification, etc).

>
>> You might want to add to your FAQ something like:
>>
>> Q: Are you sure there is not disk / controller / memory data
>> corruption? If so please state why!
>
> No, the FAQ entry is for gathering facts and data, not what
> the bug reporter *thinks* might be the problem. If there's
> corruption we'll see it in the information that is gathered, and
> then we can start to look for the source.

You *might* see it in the information that is gathered. But without 
additional checksums you are writing yourself you never can be sure. 
Meta CRCs as you implemented them will certainly help.


Cheers,
Bernd

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 3.9.0: general protection fault
  2013-05-10 10:19                 ` Bernd Schubert
@ 2013-05-10 13:33                   ` Eric Sandeen
  2013-05-11  0:12                   ` Dave Chinner
  1 sibling, 0 replies; 15+ messages in thread
From: Eric Sandeen @ 2013-05-10 13:33 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: linux-xfs

On 5/10/13 5:19 AM, Bernd Schubert wrote:
> On 05/09/2013 02:41 AM, Dave Chinner wrote:
>> On Wed, May 08, 2013 at 07:48:04PM +0200, Bernd Schubert wrote:
>>> On 05/08/2013 12:07 AM, Dave Chinner wrote:
>>>> On Tue, May 07, 2013 at 01:18:13PM +0200, Bernd Schubert wrote:
>>>>> On 05/07/2013 03:12 AM, Dave Chinner wrote:
>>>>>> On Mon, May 06, 2013 at 02:47:31PM +0200, Bernd Schubert wrote:
>>>>>>> On 05/06/2013 02:28 PM, Dave Chinner wrote:
>>>>>>>> On Mon, May 06, 2013 at 10:14:22AM +0200, Bernd Schubert wrote:
>>>>>>>>> And anpther protection fault, this time with 3.9.0. Always happens
>>>>>>>>> on one of the servers. Its ECC memory, so I don't suspect a faulty
>>>>>>>>> memory bank. Going to fsck now-
>>>>>>>>
>>>>>>>> http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F
>>>>>>>
>>>>>>> Isn't that a bit overhead? And I can't provide /proc/meminfo and
>>>>>>> others, as this issue causes a kernel panic a few traces later.
>>>>>>
>>>>>> Provide what information you can.  Without knowing a single thing
>>>>>> about your hardware, storage config and workload, I can't help you
>>>>>> at all. You're asking me to find a needle in a haystack blindfolded
>>>>>> and with both hands tied behind my back....
>>>>>
>>>>> I see that xfs_info, meminfo, etc are useful, but /proc/mounts?
>>>>> Maybe you want "cat /proc/mounts | grep xfs"?. Attached is the
>>>>> output of /proc/mounts, please let me know if you were really
>>>>> interested in all of that non-xfs output?
>>>>
>>>> Yes. You never know what is relevant to a problem that is reported,
>>>> especially if there are multiple filesystems sharing the same
>>>> device...
>>>
>>> Hmm, I see. But you need to extend your questions to multipathing
>>> and shared storage.

If you'd like to add that to the wiki it'd be great.

>> why would we? Anyone using such a configuration reporting a bug
>> usually is clueful enough to mention it in their bug report when
>> describing their RAID/LVM setup.  The FAQ entry covers the basic
>> information needed to start meaingful triage, not *all* the
>> infomration we might ask for. It's the baseline we start from.
>>
>> Indeed, the FAQ exists because I got sick of asking people for the
>> same information several times a week, every week in response to
>> poor bug reports like yours. it's far more efficient to paste a link
>> several times a week.  i.e. The FAQ entry is there for my benefit,
>> not yours.
> 
> Poor bug report or not, most information you ask about in the FAQ are entirely irrelevant for this issue.

If I had a dollar for every time a bug reporter left out "irrelevant"
information that turned out to be critical, I might be retired by now.  :)

If a few developers on the list are going to scale to supporting every
user with a problem, we need to share the effort efficiently, and that
means putting just a bit more burden on the reporter, to cut down
on the back and forth cycles of trying to gather more information.

If anyone wants a quick & useful project, perhaps a script which gathers
all requested info in the faq would be a step in the right direction.

-Eric

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 3.9.0: general protection fault
  2013-05-10 10:19                 ` Bernd Schubert
  2013-05-10 13:33                   ` Eric Sandeen
@ 2013-05-11  0:12                   ` Dave Chinner
  2013-06-03 16:39                     ` Bernd Schubert
  1 sibling, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2013-05-11  0:12 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: linux-xfs

On Fri, May 10, 2013 at 12:19:21PM +0200, Bernd Schubert wrote:
> On 05/09/2013 02:41 AM, Dave Chinner wrote:
> >On Wed, May 08, 2013 at 07:48:04PM +0200, Bernd Schubert wrote:
> >>On 05/08/2013 12:07 AM, Dave Chinner wrote:
> >>>On Tue, May 07, 2013 at 01:18:13PM +0200, Bernd Schubert wrote:
> >>>>On 05/07/2013 03:12 AM, Dave Chinner wrote:
> >>>>>On Mon, May 06, 2013 at 02:47:31PM +0200, Bernd Schubert wrote:
> >>>>>>On 05/06/2013 02:28 PM, Dave Chinner wrote:
> >>>>>>>On Mon, May 06, 2013 at 10:14:22AM +0200, Bernd Schubert wrote:
> >>>>>>>>And anpther protection fault, this time with 3.9.0. Always happens
> >>>>>>>>on one of the servers. Its ECC memory, so I don't suspect a faulty
> >>>>>>>>memory bank. Going to fsck now-
> >>>>>>>
> >>>>>>>http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F
> >>>>>>
> >>>>>>Isn't that a bit overhead? And I can't provide /proc/meminfo and
> >>>>>>others, as this issue causes a kernel panic a few traces later.
> >>>>>
> >>>>>Provide what information you can.  Without knowing a single thing
> >>>>>about your hardware, storage config and workload, I can't help you
> >>>>>at all. You're asking me to find a needle in a haystack blindfolded
> >>>>>and with both hands tied behind my back....
> >>>>
> >>>>I see that xfs_info, meminfo, etc are useful, but /proc/mounts?
> >>>>Maybe you want "cat /proc/mounts | grep xfs"?. Attached is the
> >>>>output of /proc/mounts, please let me know if you were really
> >>>>interested in all of that non-xfs output?
> >>>
> >>>Yes. You never know what is relevant to a problem that is reported,
> >>>especially if there are multiple filesystems sharing the same
> >>>device...
> >>
> >>Hmm, I see. But you need to extend your questions to multipathing
> >>and shared storage.
> >
> >why would we? Anyone using such a configuration reporting a bug
> >usually is clueful enough to mention it in their bug report when
> >describing their RAID/LVM setup.  The FAQ entry covers the basic
> >information needed to start meaingful triage, not *all* the
> >infomration we might ask for. It's the baseline we start from.
> >
> >Indeed, the FAQ exists because I got sick of asking people for the
> >same information several times a week, every week in response to
> >poor bug reports like yours. it's far more efficient to paste a link
> >several times a week.  i.e. The FAQ entry is there for my benefit,
> >not yours.
> 
> Poor bug report or not, most information you ask about in the FAQ
> are entirely irrelevant for this issue.

<sigh>

You're complaining that I've asked for irrelevant information in
reponse to your bug report. I know nothing about your system, so I
need to know some basic information before I start. So while you
might think it's irrelevant, it is critical information for me

> >I don't really care if you don't understand why we are asking for
> >that information, I simply expect you to provide it as best you can
> >if you want your problem solved.
> 
> And here we go, the bug I reported is not my problem. I simply
> reported a bug in XFS. You can use it or not, I do not care at all.

No, that's not what I said. You're hearing what you want to hear,
not what I'm saying.  I care about fixing the bug - I'd be ignoring
you if I didn't care.

However: don't confuse the fact that I don't care who you are, what
you do, how important you think you are or what you think know with
that. Who you are simply not important - you are a Random Joe from
the interwebs that has reported a bug, and I've given the same
response to you that I gave to the last hundred Random Joes that
have reported problems.

So, Random Joe, it's now your responsibility to jump through the
hoops we ask you to so that _we_ can find the cause of your problem.

> Of course I don't like bugs and I'm going to help to fix it, but I
> have a long daily todo list and I'm not going to spend my time
> filling in irrelevent items.

Oh, cry me a river. The majority of what is asked for in the FAQ can
be gathered in less than 5 minutes. You've wasted far more time than
that arguing that what I asked for is irrelevant, unnecessary and
too hard to gather and lecturing about irrelevant stuff like MMP and
HA.  Then having the gall to accuse me of not caring about fixing
your bug.  You need to pull your head in and take a long, hard look
at yourself.

I can't fix a bug without the bug reporter's help, and you seem to
be unwilling to help. Ergo, I can't fix the bug. Co-operation is
needed. If you want your problem fixed, please drop the attitude, go
back up the thread to where I asked for information from you, and
fresh start again.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: 3.9.0: general protection fault
  2013-05-11  0:12                   ` Dave Chinner
@ 2013-06-03 16:39                     ` Bernd Schubert
  0 siblings, 0 replies; 15+ messages in thread
From: Bernd Schubert @ 2013-06-03 16:39 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

Just an update here, the issue only came up once again on another system 
and I didn't have time for a couple of days to even save the collectl 
log file. I then increased the max number of rotated log files, but it 
didn't happen again ever since. So I don't have any logs about the state 
of the system at crash time so far.
However, I just got a captured file corruption:

> (squeeze)fslab3:~/fstests# cat /mnt/fhgfs//fslab4/ql-fstest/fstest13579.err
> File corruption in /mnt/fhgfs//fslab4/ql-fstest/fstest.13635/d040/d030/7ae214d1 (create time: Mon Jun  3 17:36:11 2013) around 246415360 [pattern = 7ae214d1]
> After n-checks: 3
> Expected: d1, got: 83 (pos = 247324600)
> Expected: 14, got: ec (pos = 247324601)
> Expected: e2, got: 30 (pos = 247324602)
> Expected: 7a, got: 48 (pos = 247324603)
> Expected: d1, got: 89 (pos = 247324604)
> Expected: 14, got: 5d (pos = 247324605)
> Expected: e2, got: e8 (pos = 247324606)
...
> Expected: 14, got: 84 (pos = 247324661)
> Expected: e2, got: b9 (pos = 247324662)
> Expected: 7a, got: 0 (pos = 247324663)

Hmm, exactly 64 bytes of corrupted data, the file itself has a size of 
512MiB.

I'm going to export single disks from the controller to use it with 
md-raid6 as this allows to do parity checks and to identify bad disks.


Cheers,
Bernd

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2013-06-03 16:40 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-05-02 14:45 3.8.7: general protection fault Bernd Schubert
2013-05-06  8:14 ` 3.9.0: " Bernd Schubert
2013-05-06  9:40   ` Bernd Schubert
2013-05-06 12:28   ` Dave Chinner
2013-05-06 12:47     ` Bernd Schubert
2013-05-07  1:12       ` Dave Chinner
2013-05-07 11:18         ` Bernd Schubert
2013-05-07 22:07           ` Dave Chinner
2013-05-08 17:48             ` Bernd Schubert
2013-05-09  0:41               ` Dave Chinner
2013-05-10 10:19                 ` Bernd Schubert
2013-05-10 13:33                   ` Eric Sandeen
2013-05-11  0:12                   ` Dave Chinner
2013-06-03 16:39                     ` Bernd Schubert
2013-05-09  7:16             ` Stan Hoeppner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.