All of lore.kernel.org
 help / color / mirror / Atom feed
* 3.0.3 64-bit Crash running fscache/cachefilesd
@ 2011-08-25 16:44 Mark Moseley
  2011-08-26 12:52 ` [Linux-cachefs] " Дмитрий Ильин
  2011-09-01 13:04 ` David Howells
  0 siblings, 2 replies; 46+ messages in thread
From: Mark Moseley @ 2011-08-25 16:44 UTC (permalink / raw)
  To: Linux filesystem caching discussion list, linux-kernel

I get this after a handful of hours. It's not terribly deterministic
when it's going to melt down, but typically doesn't last more than a
few hours before panicking.

This is 3.0.3, 64-bit, running Debian Squeeze, running on a usually
stable Dell PE 1950. I'm happy to run any sort of traces or send it
whatever would be useful in debugging (.config, etc). Output is over
IPMI, so it's a tad scrambled, but I didn't want to mess with it for
fear of obscuring something important. Environment is heavy NFS-backed
web hosting. Backing device that the fscache cache is on is an SSD,
but I've seen the same thing on a regular drive. The filesystem for
the fscache cache in the below example is EXT4, but I've seen the same
thing on XFS.

I should mention too that there's nothing special about the 3.0.3
crash. I get similar crashes with 2.6.39.4 and any previous kernel
I've tested. 3.0.3 is just the most recent one I've tested.


[25625.932971] ------------[ cut here ]------------
[25625.942202] kernel BUG at fs/cachefiles/namei.c:166!
[25625.942874] invalid opcode: 0000 [#1] SMP
[25625.942874] CPU 6
[25625.942874] Modules linked in: xfs ioatdma dca loop joydev fan
evdev i5000_edac edac_core psmouse i5k_amb dcdbas serio_raw shpchp
pcspkr pci_hotplug ]
[25625.942874]
[25625.942874] Pid: 23795, comm: kworker/u:5 Not tainted 3.0.3 #1 Dell
Inc. PowerEdge 1950/0DT097
[25625.942874] RIP: 0010:[<ffffffff81299cf3>]  [<ffffffff81299cf3>]
cachefiles_walk_to_object+0xcb3/0xdd0
[25625.942874] RSP: 0018:ffff8801ab84dc60  EFLAGS: 00010282
[25625.942874] RAX: ffff88003935e601 RBX: ffff8801d8cff330 RCX: 000000000047bea6
[25625.942874] RDX: 000000000047bea5 RSI: 0000000000010200 RDI: ffff88022ec02780
[25625.942874] RBP: ffff8801ab84dd50 R08: 000000000047bea5 R09: ffffea0000c83c20
[25625.942874] R10: ffffffff812982aa R11: 0000000000000003 R12: ffff8801d8cff200
[25625.942874] R13: ffff8801a4a06300 R14: ffff880224ffa780 R15: ffff8801c0dddf00
[25625.942874] FS:  0000000000000000(0000) GS:ffff88022fd80000(0000)
knlGS:0000000000000000
[25625.942874] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[25625.942874] CR2: ffffffffff600400 CR3: 00000000016a2000 CR4: 00000000000006f0
[25625.942874] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[25625.942874] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[25625.942874] Process kworker/u:5 (pid: 23795, threadinfo
ffff880082bc6300, task ffff880082bc5e00)
[25625.942874] Stack:
[25625.942874]  0000000000000003 0000000000000000 ffff8801ab84dc90
ffff880082bc5e00
[25625.942874]  ffff880082bc6228 ffff880082bc6228 ffff880082bc6228
ffff8801ab84dd08
[25625.942874]  ffff880082bc5e00 ffff88022eee5310 ffff880104639400
ffff8801f0e5f664
[25625.942874] Call Trace:
[25625.942874]  [<ffffffff81074010>] ? wake_up_bit+0x40/0x40
[25625.942874]  [<ffffffff812973ab>] cachefiles_lookup_object+0x5b/0x170
[25625.942874]  [<ffffffff811ad864>] fscache_lookup_object+0xd4/0x2b0
[25625.942874]  [<ffffffff811ae789>] fscache_object_work_func+0x4f9/0xd60
[25625.942874]  [<ffffffff8106c594>] process_one_work+0x164/0x450
[25625.942874]  [<ffffffff811ae290>] ? fscache_enqueue_dependents+0x120/0x120
[25625.942874]  [<ffffffff8106cc2b>] worker_thread+0x19b/0x430
[25625.942874]  [<ffffffff8106ca90>] ? manage_workers+0x210/0x210
[25625.942874]  [<ffffffff81073abe>] kthread+0x9e/0xb0
[25625.942874]  [<ffffffff81671194>] kernel_thread_helper+0x4/0x10
[25625.942874]  [<ffffffff8166866d>] ? retint_restore_args+0x13/0x13
[25625.942874]  [<ffffffff81073a20>] ? kthread_worker_fn+0x1a0/0x1a0
[25625.942874]  [<ffffffff81671190>] ? gs_change+0xb/0xb
[25625.942874] Code: 00 48 c7 c7 78 6d 90 81 31 c0 e8 92 b0 3c 00 0f
0b eb fe 48 c7 c7 78 7b 90 81 31 c0 e8 80 b0 3c 00 31 f6 4c 89 f7 e8
3d e5 ff ff <0
[25625.942874] RIP  [<ffffffff81299cf3>] cachefiles_walk_to_object+0xcb3/0xdd0
[25625.942874]  RSP <ffff8801ab84dc60>
2011 Aug 25 07:01:04 boscust2102[25626.490246] ---[ end trace
abce6c7388af252a ]---
 [25625.932971] ------------[ cu[25626.505216] Kernel panic - not
syncing: Fatal exception
t here ]--------[25626.520310] Pid: 23795, comm: kworker/u:5 Tainted:
G      D     3.0.3 #1
----
    2011 Aug 25[25626.534651] Call Trace:
 07:01:04 boscus[25626.542237]  [<ffffffff81664c4e>] panic+0xbf/0x1da
t2102 [25625.942[25626.554578]  [<ffffffff8104ef9f>] ? kmsg_dump+0x4f/0x100
874] invalid opc[25626.567722]  [<ffffffff81669655>] oops_end+0xa5/0xf0
ode: 0000 [#1] S[25626.580262]  [<ffffffff810058db>] die+0x5b/0x90
MP
   [25626.592190]  [<ffffffff81669170>] do_trap+0x190/0x1a0
[25626.602854]  [<ffffffff8166bf2a>] ? atomic_notifier_call_chain+0x1a/0x20
[25626.616517]  [<ffffffff810034f5>] do_invalid_op+0x95/0xb0
[25626.627565]  [<ffffffff81299cf3>] ? cachefiles_walk_to_object+0xcb3/0xdd0
[25626.641457]  [<ffffffff812febfa>] ? trace_hardirqs_off_thunk+0x3a/0x6c
[25626.654860]  [<ffffffff812982aa>] ? cachefiles_printk_object+0x7a/0x90
[25626.668259]  [<ffffffff8166869d>] ? restore_args+0x30/0x30
[25626.679472]  [<ffffffff8167101a>] invalid_op+0x1a/0x20
[25626.689963]  [<ffffffff812982aa>] ? cachefiles_printk_object+0x7a/0x90
[25626.703239]  [<ffffffff81299cf3>] ? cachefiles_walk_to_object+0xcb3/0xdd0
[25626.716973]  [<ffffffff81074010>] ? wake_up_bit+0x40/0x40
[25626.727868]  [<ffffffff812973ab>] cachefiles_lookup_object+0x5b/0x170
[25626.740810]  [<ffffffff811ad864>] fscache_lookup_object+0xd4/0x2b0
[25626.753283]  [<ffffffff811ae789>] fscache_object_work_func+0x4f9/0xd60
[25626.766459]  [<ffffffff8106c594>] process_one_work+0x164/0x450
[25626.778255]  [<ffffffff811ae290>] ? fscache_enqueue_dependents+0x120/0x120
[25626.792232]  [<ffffffff8106cc2b>] worker_thread+0x19b/0x430
[25626.803638]  [<ffffffff8106ca90>] ? manage_workers+0x210/0x210
[25626.815400]  [<ffffffff81073abe>] kthread+0x9e/0xb0
[25626.825307]  [<ffffffff81671194>] kernel_thread_helper+0x4/0x10
[25626.837233]  [<ffffffff8166866d>] ? retint_restore_args+0x13/0x13
[25626.849515]  [<ffffffff81073a20>] ? kthread_worker_fn+0x1a0/0x1a0
[25626.861838]  [<ffffffff81671190>] ? gs_change+0xb/0xb
[25626.881978] Rebooting in 120 seconds..

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Linux-cachefs] 3.0.3 64-bit Crash running fscache/cachefilesd
  2011-08-25 16:44 3.0.3 64-bit Crash running fscache/cachefilesd Mark Moseley
@ 2011-08-26 12:52 ` Дмитрий Ильин
  2011-09-01 13:04 ` David Howells
  1 sibling, 0 replies; 46+ messages in thread
From: Дмитрий Ильин @ 2011-08-26 12:52 UTC (permalink / raw)
  To: Linux filesystem caching discussion list; +Cc: linux-kernel

Same problem here bro)

Heavy NFS based web hosting too.

I get mostly this on 38 kernel

[416779.040059] CacheFiles: Error: Overlong wait for old active object
to go away
[416779.040104] object: OBJ43981
[416779.040127] objstate=OBJECT_LOOKING_UP fl=0 wbusy=2 ev=0[7b]
[416779.040154] ops=0 inp=0 exc=0
[416779.040174] parent=ffff8800cca3c3c0
[416779.040197] cookie=ffff880100029870 [pr=ffff8801131520f0
nd=ffff88000662b180 fl=7]
[416779.040241] key=[36]
'010007010100a00800000000e6c715cb23cd4aa38338eea581f3d1045c2e680a1b56d03d'
[416779.040325] xobject: OBJ431fe
[416779.040348] xobjstate=OBJECT_RECYCLING fl=0 wbusy=2 ev=20[3]
[416779.040373] xops=0 inp=0 exc=0
[416779.040394] xparent=ffff8800cca3c3c0
[416779.040416] xcookie=NULL

And 3.0.1 kernel crashes too.

No idea how to debug…

fscache is NOT stable and looks like will never be.
NFS without cache is too slow.

CIFS is even worse.
GFS2/OCFS2 too complex and unstable.
PVFS? Coda? GlusterFS? Lustre?
They do not suite well for shared web hosting(

Looks like clustering is bad idea and we'll have to return to separate servers and caching servers.



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Linux-cachefs] 3.0.3 64-bit Crash running fscache/cachefilesd
  2011-08-25 16:44 3.0.3 64-bit Crash running fscache/cachefilesd Mark Moseley
  2011-08-26 12:52 ` [Linux-cachefs] " Дмитрий Ильин
@ 2011-09-01 13:04 ` David Howells
  2011-09-22 17:03   ` Mark Moseley
  1 sibling, 1 reply; 46+ messages in thread
From: David Howells @ 2011-09-01 13:04 UTC (permalink / raw)
  To: Linux filesystem caching discussion list; +Cc: dhowells, linux-kernel

Mark Moseley <moseleymark@gmail.com> wrote:

> [25625.932971] ------------[ cut here ]------------

There's some important information before the cut-here line.  If you could
paste that too, it would be most useful.

David

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Linux-cachefs] 3.0.3 64-bit Crash running fscache/cachefilesd
  2011-09-01 13:04 ` David Howells
@ 2011-09-22 17:03   ` Mark Moseley
  2011-09-22 21:41     ` Mark Moseley
  2011-09-26 11:32     ` David Howells
  0 siblings, 2 replies; 46+ messages in thread
From: Mark Moseley @ 2011-09-22 17:03 UTC (permalink / raw)
  Cc: Linux filesystem caching discussion list, linux-kernel

On Thu, Sep 1, 2011 at 6:04 AM, David Howells <dhowells@redhat.com> wrote:
> Mark Moseley <moseleymark@gmail.com> wrote:
>
>> [25625.932971] ------------[ cut here ]------------
>
> There's some important information before the cut-here line.  If you could
> paste that too, it would be most useful.
>

Somehow I completely missed this reply, or I'd have replied later the
same day :)

I'll have to grab a new trace. I'll post that as soon as I get it.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Linux-cachefs] 3.0.3 64-bit Crash running fscache/cachefilesd
  2011-09-22 17:03   ` Mark Moseley
@ 2011-09-22 21:41     ` Mark Moseley
  2011-09-26 11:32     ` David Howells
  1 sibling, 0 replies; 46+ messages in thread
From: Mark Moseley @ 2011-09-22 21:41 UTC (permalink / raw)
  Cc: Linux filesystem caching discussion list, linux-kernel

On Thu, Sep 22, 2011 at 10:03 AM, Mark Moseley <moseleymark@gmail.com> wrote:
> On Thu, Sep 1, 2011 at 6:04 AM, David Howells <dhowells@redhat.com> wrote:
>> Mark Moseley <moseleymark@gmail.com> wrote:
>>
>>> [25625.932971] ------------[ cut here ]------------
>>
>> There's some important information before the cut-here line.  If you could
>> paste that too, it would be most useful.
>>
>
> Somehow I completely missed this reply, or I'd have replied later the
> same day :)
>
> I'll have to grab a new trace. I'll post that as soon as I get it.


I thought I'd be extra-helpful by getting that trace with a 3.0.4
kernel but got a completely different error this time (there was
nothing logged above this though). There was a
'__fscache_read_or_alloc_pages' crash for the previous boot too,
though it went for about 2.5 hours that time (with an empty cache
partition though).

[  606.240505] BUG: unable to handle kernel NULL pointer dereference
at 0000000000000040
[  606.250342] IP: [<ffffffff811ab9aa>]
__fscache_read_or_alloc_pages+0x14a/0x3b0
[  606.250342] PGD 62b46067 PUD 62b4e067 PMD 0
[  606.250342] Oops: 0000 [#1] SMP
[  606.250342] CPU 4
[  606.250342] Modules linked in: xfs ioatdma dca loop joydev evdev
psmouse i5000_edac edac_core dcdbas shpchp pci_hotplug i5k_amb
serio_raw pcspkr sg s]
[  606.250342]
[  606.250342] Pid: 5902, comm: httpd Not tainted 3.0.4 #1 Dell Inc.
PowerEdge 1950/0DT097
[  606.250342] RIP: 0010:[<ffffffff811ab9aa>]  [<ffffffff811ab9aa>]
__fscache_read_or_alloc_pages+0x14a/0x3b0
[  606.250342] RSP: 0018:ffff880062b519f8  EFLAGS: 00010246
[  606.250342] RAX: 0000000000000000 RBX: ffff8801188f6c30 RCX: 00000000c0000100
[  606.250342] RDX: 0000000000000000 RSI: ffff880062b51958 RDI: ffff88022ffd2b88
[  606.250342] RBP: ffff880062b51a38 R08: ffff8801ecb4e300 R09: 0000000000000008
[  606.250342] R10: 0000000000000007 R11: 0000000000000001 R12: 0000000000000000
[  606.250342] R13: ffff8800cb369380 R14: ffff880062b51bc8 R15: ffff880062b51aa4
[  606.250342] FS:  0000033d3885d6e0(0000) GS:ffff88022fd00000(0000)
knlGS:0000000000000000
[  606.250342] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  606.250342] CR2: 0000000000000040 CR3: 0000000001681000 CR4: 00000000000006f0
[  606.250342] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  606.250342] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  606.250342] Process httpd (pid: 5902, threadinfo ffff8801ecb4e300,
task ffff8801ecb4de00)
[  606.250342] Stack:
[  606.250342]  ffff88012172fa80 ffffffff81259c40 ffff880111153000
00000000ffffff97
[  606.250342]  ffff880062b51aa4 0000000000000001 ffff880061907050
ffff880061907198
[  606.250342]  ffff880062b51a88 ffffffff81259af0 ffffffff000200da
0000000000000002
[  606.250342] Call Trace:
[  606.250342]  [<ffffffff81259c40>] ? __nfs_readpages_from_fscache+0x1c0/0x1c0
[  606.250342]  [<ffffffff81259af0>] __nfs_readpages_from_fscache+0x70/0x1c0
[  606.250342]  [<ffffffff8123a6c6>] nfs_readpages+0xd6/0x200
[  606.250342]  [<ffffffff810f2e9a>] __do_page_cache_readahead+0x1da/0x270
[  606.250342]  [<ffffffff810f2f51>] ra_submit+0x21/0x30
[  606.250342]  [<ffffffff810f317d>] ondemand_readahead+0x11d/0x250
[  606.250342]  [<ffffffff810f33a6>] page_cache_sync_readahead+0x36/0x60
[  606.250342]  [<ffffffff810eb1ec>] generic_file_aio_read+0x45c/0x770
[  606.250342]  [<ffffffff8122df01>] nfs_file_read+0xe1/0x130
[  606.250342]  [<ffffffff8112d4c9>] do_sync_read+0xd9/0x120
[  606.250342]  [<ffffffff810ea629>] ? __filemap_fdatawrite_range+0x59/0x70
[  606.250342]  [<ffffffff8113206b>] ? cp_new_stat+0x16b/0x190
[  606.250342]  [<ffffffff8112e335>] vfs_read+0xf5/0x210
[  606.250342]  [<ffffffff8112e545>] sys_read+0x55/0x90
[  606.250342]  [<ffffffff81650ef4>] system_call_fastpath+0x16/0x1b
[  606.250342] Code: 48 8b 7a 28 ff d0 48 c7 c1 64 59 d5 81 48 c7 c2
6c 59 d5 81 48 8b 75 d0 4c 89 ef e8 c1 fa ff ff 41 89 c4 85 c0 78 51
49 8b 45 70
[  606.250342]  8b 40 40 a8 04 0f 84 4a 01 00 00 f0 ff 05 97 a0 ba 00 71 09
[  606.250342] RIP  [<ffffffff811ab9aa>]
__fscache_read_or_alloc_pages+0x14a/0x3b0
[  606.250342]  RSP <ffff880062b519f8>
[  606.250342] CR2: 0000000000000040
[  606.843210] ---[ end trace 48cd119076837772 ]---


I'll boot into that same 3.0.3 kernel as before and see if I get
anything logged before the '--cut--' line for it.

In general, the same pattern of things seeming to be generally working
fine when the cache dir is empty and things start to go bad when the
cache partition is fullish. I haven't measured this with any
exactitude though, so could be coincidental.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Linux-cachefs] 3.0.3 64-bit Crash running fscache/cachefilesd
  2011-09-22 17:03   ` Mark Moseley
  2011-09-22 21:41     ` Mark Moseley
@ 2011-09-26 11:32     ` David Howells
  2011-09-26 21:02       ` Mark Moseley
  1 sibling, 1 reply; 46+ messages in thread
From: David Howells @ 2011-09-26 11:32 UTC (permalink / raw)
  To: Linux filesystem caching discussion list; +Cc: dhowells, linux-kernel

Mark Moseley <moseleymark@gmail.com> wrote:

> I thought I'd be extra-helpful by getting that trace with a 3.0.4
> kernel but got a completely different error this time (there was
> nothing logged above this though). There was a
> '__fscache_read_or_alloc_pages' crash for the previous boot too,
> though it went for about 2.5 hours that time (with an empty cache
> partition though).

I'm fairly certain I know what the cause of this one is: Invalidation upon
server change isn't handled correctly.  NFS tries to invalidate a file by
discarding that file's attachment to the cache - without first clearing up the
operations it has outstanding on the cache for that file.

I'm working on adding formal invalidation at the moment.

The attached patch may get you more precise information.  The first hunk is the
main catcher.

David
---
diff --git a/fs/fscache/cookie.c b/fs/fscache/cookie.c
index 9905350..48c63b8 100644
--- a/fs/fscache/cookie.c
+++ b/fs/fscache/cookie.c
@@ -452,6 +452,13 @@ void __fscache_relinquish_cookie(struct fscache_cookie *cookie, int retire)
 
 		_debug("RELEASE OBJ%x", object->debug_id);
 
+		if (atomic_read(&object->n_reads)) {
+			spin_unlock(&cookie->lock);
+			printk(KERN_ERR "FS-Cache: Cookie '%s' still has outstanding reads\n",
+			       cookie->def->name);
+			BUG();
+		}
+
 		/* detach each cache object from the object cookie */
 		spin_lock(&object->lock);
 		hlist_del_init(&object->cookie_link);
diff --git a/fs/fscache/page.c b/fs/fscache/page.c
index b8b62f4..f087051 100644
--- a/fs/fscache/page.c
+++ b/fs/fscache/page.c
@@ -496,6 +496,7 @@ int __fscache_read_or_alloc_pages(struct fscache_cookie *cookie,
 	if (fscache_submit_op(object, &op->op) < 0)
 		goto nobufs_unlock;
 	spin_unlock(&cookie->lock);
+	ASSERTCMP(object->cookie, ==, cookie);
 
 	fscache_stat(&fscache_n_retrieval_ops);
 
@@ -513,6 +514,26 @@ int __fscache_read_or_alloc_pages(struct fscache_cookie *cookie,
 		goto error;
 
 	/* ask the cache to honour the operation */
+	if (!object->cookie) {
+		const char prefix[] = "fs-";
+		printk(KERN_ERR "%sobject: OBJ%x\n",
+		       prefix, object->debug_id);
+		printk(KERN_ERR "%sobjstate=%s fl=%lx wbusy=%x ev=%lx[%lx]\n",
+		       prefix, fscache_object_states[object->state],
+		       object->flags, work_busy(&object->work),
+		       object->events,
+		       object->event_mask & FSCACHE_OBJECT_EVENTS_MASK);
+		printk(KERN_ERR "%sops=%u inp=%u exc=%u\n",
+		       prefix, object->n_ops, object->n_in_progress,
+		       object->n_exclusive);
+		printk(KERN_ERR "%sparent=%p\n",
+		       prefix, object->parent);
+		printk(KERN_ERR "%scookie=%p [pr=%p nd=%p fl=%lx]\n",
+		       prefix, object->cookie,
+		       cookie->parent, cookie->netfs_data, cookie->flags);
+	}
+	ASSERTCMP(object->cookie, ==, cookie);
+
 	if (test_bit(FSCACHE_COOKIE_NO_DATA_YET, &object->cookie->flags)) {
 		fscache_stat(&fscache_n_cop_allocate_pages);
 		ret = object->cache->ops->allocate_pages(

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [Linux-cachefs] 3.0.3 64-bit Crash running fscache/cachefilesd
  2011-09-26 11:32     ` David Howells
@ 2011-09-26 21:02       ` Mark Moseley
  2011-09-27  0:59         ` Mark Moseley
  2011-09-29 14:57         ` David Howells
  0 siblings, 2 replies; 46+ messages in thread
From: Mark Moseley @ 2011-09-26 21:02 UTC (permalink / raw)
  Cc: Linux filesystem caching discussion list, linux-kernel

On Mon, Sep 26, 2011 at 4:32 AM, David Howells <dhowells@redhat.com> wrote:
> Mark Moseley <moseleymark@gmail.com> wrote:
>
>> I thought I'd be extra-helpful by getting that trace with a 3.0.4
>> kernel but got a completely different error this time (there was
>> nothing logged above this though). There was a
>> '__fscache_read_or_alloc_pages' crash for the previous boot too,
>> though it went for about 2.5 hours that time (with an empty cache
>> partition though).
>
> I'm fairly certain I know what the cause of this one is: Invalidation upon
> server change isn't handled correctly.  NFS tries to invalidate a file by
> discarding that file's attachment to the cache - without first clearing up the
> operations it has outstanding on the cache for that file.
>
> I'm working on adding formal invalidation at the moment.
>
> The attached patch may get you more precise information.  The first hunk is the
> main catcher.
>
> David
> ---
> diff --git a/fs/fscache/cookie.c b/fs/fscache/cookie.c
> index 9905350..48c63b8 100644
> --- a/fs/fscache/cookie.c
> +++ b/fs/fscache/cookie.c
> @@ -452,6 +452,13 @@ void __fscache_relinquish_cookie(struct fscache_cookie *cookie, int retire)
>
>                _debug("RELEASE OBJ%x", object->debug_id);
>
> +               if (atomic_read(&object->n_reads)) {
> +                       spin_unlock(&cookie->lock);
> +                       printk(KERN_ERR "FS-Cache: Cookie '%s' still has outstanding reads\n",
> +                              cookie->def->name);
> +                       BUG();
> +               }
> +
>                /* detach each cache object from the object cookie */
>                spin_lock(&object->lock);
>                hlist_del_init(&object->cookie_link);
> diff --git a/fs/fscache/page.c b/fs/fscache/page.c
> index b8b62f4..f087051 100644
> --- a/fs/fscache/page.c
> +++ b/fs/fscache/page.c
> @@ -496,6 +496,7 @@ int __fscache_read_or_alloc_pages(struct fscache_cookie *cookie,
>        if (fscache_submit_op(object, &op->op) < 0)
>                goto nobufs_unlock;
>        spin_unlock(&cookie->lock);
> +       ASSERTCMP(object->cookie, ==, cookie);
>
>        fscache_stat(&fscache_n_retrieval_ops);
>
> @@ -513,6 +514,26 @@ int __fscache_read_or_alloc_pages(struct fscache_cookie *cookie,
>                goto error;
>
>        /* ask the cache to honour the operation */
> +       if (!object->cookie) {
> +               const char prefix[] = "fs-";
> +               printk(KERN_ERR "%sobject: OBJ%x\n",
> +                      prefix, object->debug_id);
> +               printk(KERN_ERR "%sobjstate=%s fl=%lx wbusy=%x ev=%lx[%lx]\n",
> +                      prefix, fscache_object_states[object->state],
> +                      object->flags, work_busy(&object->work),
> +                      object->events,
> +                      object->event_mask & FSCACHE_OBJECT_EVENTS_MASK);
> +               printk(KERN_ERR "%sops=%u inp=%u exc=%u\n",
> +                      prefix, object->n_ops, object->n_in_progress,
> +                      object->n_exclusive);
> +               printk(KERN_ERR "%sparent=%p\n",
> +                      prefix, object->parent);
> +               printk(KERN_ERR "%scookie=%p [pr=%p nd=%p fl=%lx]\n",
> +                      prefix, object->cookie,
> +                      cookie->parent, cookie->netfs_data, cookie->flags);
> +       }
> +       ASSERTCMP(object->cookie, ==, cookie);
> +
>        if (test_bit(FSCACHE_COOKIE_NO_DATA_YET, &object->cookie->flags)) {
>                fscache_stat(&fscache_n_cop_allocate_pages);
>                ret = object->cache->ops->allocate_pages(
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

Ok, patched and running now. This same box was running 3.0.3 over the
weekend but it died without a stacktrace (and I had set it up to not
start cachefilesd on boot for the next boot). After I get the trace
for 3.0.4, I'll boot back into 3.0.3 and see if I can get that
previous trace again.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Linux-cachefs] 3.0.3 64-bit Crash running fscache/cachefilesd
  2011-09-26 21:02       ` Mark Moseley
@ 2011-09-27  0:59         ` Mark Moseley
  2011-09-27 23:46           ` Mark Moseley
  2011-09-29 14:57         ` David Howells
  1 sibling, 1 reply; 46+ messages in thread
From: Mark Moseley @ 2011-09-27  0:59 UTC (permalink / raw)
  Cc: Linux filesystem caching discussion list, linux-kernel

On Mon, Sep 26, 2011 at 2:02 PM, Mark Moseley <moseleymark@gmail.com> wrote:
> On Mon, Sep 26, 2011 at 4:32 AM, David Howells <dhowells@redhat.com> wrote:
>> Mark Moseley <moseleymark@gmail.com> wrote:
>>
>>> I thought I'd be extra-helpful by getting that trace with a 3.0.4
>>> kernel but got a completely different error this time (there was
>>> nothing logged above this though). There was a
>>> '__fscache_read_or_alloc_pages' crash for the previous boot too,
>>> though it went for about 2.5 hours that time (with an empty cache
>>> partition though).
>>
>> I'm fairly certain I know what the cause of this one is: Invalidation upon
>> server change isn't handled correctly.  NFS tries to invalidate a file by
>> discarding that file's attachment to the cache - without first clearing up the
>> operations it has outstanding on the cache for that file.
>>
>> I'm working on adding formal invalidation at the moment.
>>
>> The attached patch may get you more precise information.  The first hunk is the
>> main catcher.
>>
>> David
>> ---
>> diff --git a/fs/fscache/cookie.c b/fs/fscache/cookie.c
>> index 9905350..48c63b8 100644
>> --- a/fs/fscache/cookie.c
>> +++ b/fs/fscache/cookie.c
>> @@ -452,6 +452,13 @@ void __fscache_relinquish_cookie(struct fscache_cookie *cookie, int retire)
>>
>>                _debug("RELEASE OBJ%x", object->debug_id);
>>
>> +               if (atomic_read(&object->n_reads)) {
>> +                       spin_unlock(&cookie->lock);
>> +                       printk(KERN_ERR "FS-Cache: Cookie '%s' still has outstanding reads\n",
>> +                              cookie->def->name);
>> +                       BUG();
>> +               }
>> +
>>                /* detach each cache object from the object cookie */
>>                spin_lock(&object->lock);
>>                hlist_del_init(&object->cookie_link);
>> diff --git a/fs/fscache/page.c b/fs/fscache/page.c
>> index b8b62f4..f087051 100644
>> --- a/fs/fscache/page.c
>> +++ b/fs/fscache/page.c
>> @@ -496,6 +496,7 @@ int __fscache_read_or_alloc_pages(struct fscache_cookie *cookie,
>>        if (fscache_submit_op(object, &op->op) < 0)
>>                goto nobufs_unlock;
>>        spin_unlock(&cookie->lock);
>> +       ASSERTCMP(object->cookie, ==, cookie);
>>
>>        fscache_stat(&fscache_n_retrieval_ops);
>>
>> @@ -513,6 +514,26 @@ int __fscache_read_or_alloc_pages(struct fscache_cookie *cookie,
>>                goto error;
>>
>>        /* ask the cache to honour the operation */
>> +       if (!object->cookie) {
>> +               const char prefix[] = "fs-";
>> +               printk(KERN_ERR "%sobject: OBJ%x\n",
>> +                      prefix, object->debug_id);
>> +               printk(KERN_ERR "%sobjstate=%s fl=%lx wbusy=%x ev=%lx[%lx]\n",
>> +                      prefix, fscache_object_states[object->state],
>> +                      object->flags, work_busy(&object->work),
>> +                      object->events,
>> +                      object->event_mask & FSCACHE_OBJECT_EVENTS_MASK);
>> +               printk(KERN_ERR "%sops=%u inp=%u exc=%u\n",
>> +                      prefix, object->n_ops, object->n_in_progress,
>> +                      object->n_exclusive);
>> +               printk(KERN_ERR "%sparent=%p\n",
>> +                      prefix, object->parent);
>> +               printk(KERN_ERR "%scookie=%p [pr=%p nd=%p fl=%lx]\n",
>> +                      prefix, object->cookie,
>> +                      cookie->parent, cookie->netfs_data, cookie->flags);
>> +       }
>> +       ASSERTCMP(object->cookie, ==, cookie);
>> +
>>        if (test_bit(FSCACHE_COOKIE_NO_DATA_YET, &object->cookie->flags)) {
>>                fscache_stat(&fscache_n_cop_allocate_pages);
>>                ret = object->cache->ops->allocate_pages(
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>>
>
> Ok, patched and running now. This same box was running 3.0.3 over the
> weekend but it died without a stacktrace (and I had set it up to not
> start cachefilesd on boot for the next boot). After I get the trace
> for 3.0.4, I'll boot back into 3.0.3 and see if I can get that
> previous trace again.
>

Ok, this just popped 15 mins ago. Here's the trace. Some went to IPMI,
some made it to the logs, so I'll post both (logs first):


[12999.564897] FS-Cache: Cookie 'NFS.fh' still has outstanding reads
[12999.564927] ------------[ cut here ]------------
[12999.574160] kernel BUG at fs/fscache/cookie.c:459!
[12999.574694] invalid opcode: 0000 [#1] SMP
[12999.574694] CPU 2
[12999.574694] Modules linked in: xfs ioatdma dca loop joydev evdev
dcdbas psmouse pcspkr serio_raw i5000_edac edac_core i5k_amb shpchp
pci_hotplug sg sr_mod cdrom ehci_hcd uhci_hcd sd_mod crc_t10dif [last
unloaded: scsi_wait_scan]
[12999.574694]
[12999.574694] Pid: 8589, comm: httpd Not tainted 3.0.4 #1 Dell Inc.
PowerEdge 1950/0DT097
[12999.574694] RIP: 0010:[<ffffffff811a6370>]  [<ffffffff811a6370>]
__fscache_relinquish_cookie+0x1c0/0x3b0
[12999.574694] RSP: 0018:ffff8800c64b1cd8  EFLAGS: 00010296
[12999.574694] RAX: 000000000000004b RBX: ffff8801b1eea780 RCX: ffffffff81cfd080
[12999.574694] RDX: 00000000ffffffff RSI: 0000000000000082 RDI: ffffffff81cfcf70
[12999.574694] RBP: ffff8800c64b1d18 R08: 0000000000000006 R09: 0000000000000000
[12999.574694] R10: 0000000000000003 R11: 00000000ffffffff R12: ffff8800c962eb40
[12999.574694] R13: ffff8801b1eea7d8 R14: ffff880224d73800 R15: ffff8800c962eb48
[12999.574694] FS:  0000036e5df836e0(0000) GS:ffff88022fc80000(0000)
knlGS:0000000000000000
[12999.574694] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[12999.574694] CR2: ffffffffff600400 CR3: 000000000167f000 CR4: 00000000000006f0
[12999.574694] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[12999.574694] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400


>From console via IPMI:

[12999.564927] ------------[ cut here ]------------
[12999.574160] kernel BUG at fs/fscache/cookie.c:459!
[12999.574694] invalid opcode: 0000 [#1] SMP
[12999.574694] CPU 2
[12999.574694] Modules linked in: xfs ioatdma dca loop joydev evdev
dcdbas psmouse pcspkr serio_raw i5000_edac edac_core i5k_amb shpchp
pci_hotplug sg sr
_mod cdrom ehci_hcd uhci_hcd sd_mod crc_t10dif [last unloaded: scsi_wait_scan]
[12999.574694]
[12999.574694] Pid: 8589, comm: httpd Not tainted 3.0.4 #1 Dell Inc.
PowerEdge 1950/0DT097
[12999.574694] RIP: 0010:[<ffffffff811a6370>]  [<ffffffff811a6370>]
__fscache_relinquish_cookie+0x1c0/0x3b0
[12999.574694] RSP: 0018:ffff8800c64b1cd8  EFLAGS: 00010296
[12999.574694] RAX: 000000000000004b RBX: ffff8801b1eea780 RCX: ffffffff81cfd080
[12999.574694] RDX: 00000000ffffffff RSI: 0000000000000082 RDI: ffffffff81cfcf70
[12999.574694] RBP: ffff8800c64b1d18 R08: 0000000000000006 R09: 0000000000000000
[12999.574694] R10: 0000000000000003 R11: 00000000ffffffff R12: ffff8800c962eb40
[12999.574694] R13: ffff8801b1eea7d8 R14: ffff880224d73800 R15: ffff8800c962eb48
[12999.574694] FS:  0000036e5df836e0(0000) GS:ffff88022fc80000(0000)
knlGS:0000000000000000
[12999.574694] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[12999.574694] CR2: ffffffffff600400 CR3: 000000000167f000 CR4: 00000000000006f0
[12999.574694] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[12999.574694] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[12999.574694] Process httpd (pid: 8589, threadinfo ffff880208ff8500,
task ffff880208ff8000)
[12999.574694] Stack:
[12999.574694]  ffffffff81073dd0 00000005c64b1ce0 ffff880208ff8428
ffff88019328ad98
[12999.574694]  ffff88019328aed0 ffff88019328ad08 ffff880224d73800
ffff880037a9d500
[12999.574694]  ffff8800c64b1d58 ffffffff8125a355 ffff88019328b018
ffff88019328aed0
[12999.574694] Call Trace:
[12999.574694]  [<ffffffff81073dd0>] ? autoremove_wake_function+0x40/0x40
[12999.574694]  [<ffffffff8125a355>] nfs_fscache_reset_inode_cookie+0x85/0x100
[12999.574694]  [<ffffffff81230316>] nfs_revalidate_mapping+0xb6/0x130
[12999.574694]  [<ffffffff8122e026>] nfs_file_read+0x86/0x130
[12999.574694]  [<ffffffff8112d4c9>] do_sync_read+0xd9/0x120
[12999.574694]  [<ffffffff810ea629>] ? __filemap_fdatawrite_range+0x59/0x70
[12999.574694]  [<ffffffff8113206b>] ? cp_new_stat+0x16b/0x190
[12999.574694]  [<ffffffff8112e335>] vfs_read+0xf5/0x210
[12999.574694]  [<ffffffff8112e545>] sys_read+0x55/0x90
[12999.574694]  [<ffffffff81651074>] system_call_fastpath+0x16/0x1b
[12999.574694] Code: 7c 6b 4a 00 48 83 c4 18 5b 41 5c 41 5d 41 5e 41
5f c9 c3 41 fe 44 24 08 48 c7 c7 88 27 8d 81 49 8b 74 24 18 31 c0 e8
55 6b 4a 00 <0f> 0b eb fe 41 8b 55 ac 48 8b 75 d0 48 c7 c7 93 a9 8c 81
31 c0
[12999.574694] RIP  [<ffffffff811a6370>] __fscache_relinquish_cookie+0x1c0/0x3b0
[12999.574694]  RSP <ffff8800c64b1cd8>
[12999.564927] [13000.116916] ---[ end trace 67856f08df77c469 ]---


I'll grab 3.0.3 tomorrow.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Linux-cachefs] 3.0.3 64-bit Crash running fscache/cachefilesd
  2011-09-27  0:59         ` Mark Moseley
@ 2011-09-27 23:46           ` Mark Moseley
  0 siblings, 0 replies; 46+ messages in thread
From: Mark Moseley @ 2011-09-27 23:46 UTC (permalink / raw)
  Cc: Linux filesystem caching discussion list, linux-kernel

On Mon, Sep 26, 2011 at 5:59 PM, Mark Moseley <moseleymark@gmail.com> wrote:
> On Mon, Sep 26, 2011 at 2:02 PM, Mark Moseley <moseleymark@gmail.com> wrote:
>> On Mon, Sep 26, 2011 at 4:32 AM, David Howells <dhowells@redhat.com> wrote:
>>> Mark Moseley <moseleymark@gmail.com> wrote:
>>>
>>>> I thought I'd be extra-helpful by getting that trace with a 3.0.4
>>>> kernel but got a completely different error this time (there was
>>>> nothing logged above this though). There was a
>>>> '__fscache_read_or_alloc_pages' crash for the previous boot too,
>>>> though it went for about 2.5 hours that time (with an empty cache
>>>> partition though).
>>>
>>> I'm fairly certain I know what the cause of this one is: Invalidation upon
>>> server change isn't handled correctly.  NFS tries to invalidate a file by
>>> discarding that file's attachment to the cache - without first clearing up the
>>> operations it has outstanding on the cache for that file.
>>>
>>> I'm working on adding formal invalidation at the moment.
>>>
>>> The attached patch may get you more precise information.  The first hunk is the
>>> main catcher.
>>>
>>> David
>>> ---
>>> diff --git a/fs/fscache/cookie.c b/fs/fscache/cookie.c
>>> index 9905350..48c63b8 100644
>>> --- a/fs/fscache/cookie.c
>>> +++ b/fs/fscache/cookie.c
>>> @@ -452,6 +452,13 @@ void __fscache_relinquish_cookie(struct fscache_cookie *cookie, int retire)
>>>
>>>                _debug("RELEASE OBJ%x", object->debug_id);
>>>
>>> +               if (atomic_read(&object->n_reads)) {
>>> +                       spin_unlock(&cookie->lock);
>>> +                       printk(KERN_ERR "FS-Cache: Cookie '%s' still has outstanding reads\n",
>>> +                              cookie->def->name);
>>> +                       BUG();
>>> +               }
>>> +
>>>                /* detach each cache object from the object cookie */
>>>                spin_lock(&object->lock);
>>>                hlist_del_init(&object->cookie_link);
>>> diff --git a/fs/fscache/page.c b/fs/fscache/page.c
>>> index b8b62f4..f087051 100644
>>> --- a/fs/fscache/page.c
>>> +++ b/fs/fscache/page.c
>>> @@ -496,6 +496,7 @@ int __fscache_read_or_alloc_pages(struct fscache_cookie *cookie,
>>>        if (fscache_submit_op(object, &op->op) < 0)
>>>                goto nobufs_unlock;
>>>        spin_unlock(&cookie->lock);
>>> +       ASSERTCMP(object->cookie, ==, cookie);
>>>
>>>        fscache_stat(&fscache_n_retrieval_ops);
>>>
>>> @@ -513,6 +514,26 @@ int __fscache_read_or_alloc_pages(struct fscache_cookie *cookie,
>>>                goto error;
>>>
>>>        /* ask the cache to honour the operation */
>>> +       if (!object->cookie) {
>>> +               const char prefix[] = "fs-";
>>> +               printk(KERN_ERR "%sobject: OBJ%x\n",
>>> +                      prefix, object->debug_id);
>>> +               printk(KERN_ERR "%sobjstate=%s fl=%lx wbusy=%x ev=%lx[%lx]\n",
>>> +                      prefix, fscache_object_states[object->state],
>>> +                      object->flags, work_busy(&object->work),
>>> +                      object->events,
>>> +                      object->event_mask & FSCACHE_OBJECT_EVENTS_MASK);
>>> +               printk(KERN_ERR "%sops=%u inp=%u exc=%u\n",
>>> +                      prefix, object->n_ops, object->n_in_progress,
>>> +                      object->n_exclusive);
>>> +               printk(KERN_ERR "%sparent=%p\n",
>>> +                      prefix, object->parent);
>>> +               printk(KERN_ERR "%scookie=%p [pr=%p nd=%p fl=%lx]\n",
>>> +                      prefix, object->cookie,
>>> +                      cookie->parent, cookie->netfs_data, cookie->flags);
>>> +       }
>>> +       ASSERTCMP(object->cookie, ==, cookie);
>>> +
>>>        if (test_bit(FSCACHE_COOKIE_NO_DATA_YET, &object->cookie->flags)) {
>>>                fscache_stat(&fscache_n_cop_allocate_pages);
>>>                ret = object->cache->ops->allocate_pages(
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> Please read the FAQ at  http://www.tux.org/lkml/
>>>
>>
>> Ok, patched and running now. This same box was running 3.0.3 over the
>> weekend but it died without a stacktrace (and I had set it up to not
>> start cachefilesd on boot for the next boot). After I get the trace
>> for 3.0.4, I'll boot back into 3.0.3 and see if I can get that
>> previous trace again.
>>
>
> Ok, this just popped 15 mins ago. Here's the trace. Some went to IPMI,
> some made it to the logs, so I'll post both (logs first):
>
>
> [12999.564897] FS-Cache: Cookie 'NFS.fh' still has outstanding reads
> [12999.564927] ------------[ cut here ]------------
> [12999.574160] kernel BUG at fs/fscache/cookie.c:459!
> [12999.574694] invalid opcode: 0000 [#1] SMP
> [12999.574694] CPU 2
> [12999.574694] Modules linked in: xfs ioatdma dca loop joydev evdev
> dcdbas psmouse pcspkr serio_raw i5000_edac edac_core i5k_amb shpchp
> pci_hotplug sg sr_mod cdrom ehci_hcd uhci_hcd sd_mod crc_t10dif [last
> unloaded: scsi_wait_scan]
> [12999.574694]
> [12999.574694] Pid: 8589, comm: httpd Not tainted 3.0.4 #1 Dell Inc.
> PowerEdge 1950/0DT097
> [12999.574694] RIP: 0010:[<ffffffff811a6370>]  [<ffffffff811a6370>]
> __fscache_relinquish_cookie+0x1c0/0x3b0
> [12999.574694] RSP: 0018:ffff8800c64b1cd8  EFLAGS: 00010296
> [12999.574694] RAX: 000000000000004b RBX: ffff8801b1eea780 RCX: ffffffff81cfd080
> [12999.574694] RDX: 00000000ffffffff RSI: 0000000000000082 RDI: ffffffff81cfcf70
> [12999.574694] RBP: ffff8800c64b1d18 R08: 0000000000000006 R09: 0000000000000000
> [12999.574694] R10: 0000000000000003 R11: 00000000ffffffff R12: ffff8800c962eb40
> [12999.574694] R13: ffff8801b1eea7d8 R14: ffff880224d73800 R15: ffff8800c962eb48
> [12999.574694] FS:  0000036e5df836e0(0000) GS:ffff88022fc80000(0000)
> knlGS:0000000000000000
> [12999.574694] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [12999.574694] CR2: ffffffffff600400 CR3: 000000000167f000 CR4: 00000000000006f0
> [12999.574694] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [12999.574694] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>
>
> From console via IPMI:
>
> [12999.564927] ------------[ cut here ]------------
> [12999.574160] kernel BUG at fs/fscache/cookie.c:459!
> [12999.574694] invalid opcode: 0000 [#1] SMP
> [12999.574694] CPU 2
> [12999.574694] Modules linked in: xfs ioatdma dca loop joydev evdev
> dcdbas psmouse pcspkr serio_raw i5000_edac edac_core i5k_amb shpchp
> pci_hotplug sg sr
> _mod cdrom ehci_hcd uhci_hcd sd_mod crc_t10dif [last unloaded: scsi_wait_scan]
> [12999.574694]
> [12999.574694] Pid: 8589, comm: httpd Not tainted 3.0.4 #1 Dell Inc.
> PowerEdge 1950/0DT097
> [12999.574694] RIP: 0010:[<ffffffff811a6370>]  [<ffffffff811a6370>]
> __fscache_relinquish_cookie+0x1c0/0x3b0
> [12999.574694] RSP: 0018:ffff8800c64b1cd8  EFLAGS: 00010296
> [12999.574694] RAX: 000000000000004b RBX: ffff8801b1eea780 RCX: ffffffff81cfd080
> [12999.574694] RDX: 00000000ffffffff RSI: 0000000000000082 RDI: ffffffff81cfcf70
> [12999.574694] RBP: ffff8800c64b1d18 R08: 0000000000000006 R09: 0000000000000000
> [12999.574694] R10: 0000000000000003 R11: 00000000ffffffff R12: ffff8800c962eb40
> [12999.574694] R13: ffff8801b1eea7d8 R14: ffff880224d73800 R15: ffff8800c962eb48
> [12999.574694] FS:  0000036e5df836e0(0000) GS:ffff88022fc80000(0000)
> knlGS:0000000000000000
> [12999.574694] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [12999.574694] CR2: ffffffffff600400 CR3: 000000000167f000 CR4: 00000000000006f0
> [12999.574694] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [12999.574694] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [12999.574694] Process httpd (pid: 8589, threadinfo ffff880208ff8500,
> task ffff880208ff8000)
> [12999.574694] Stack:
> [12999.574694]  ffffffff81073dd0 00000005c64b1ce0 ffff880208ff8428
> ffff88019328ad98
> [12999.574694]  ffff88019328aed0 ffff88019328ad08 ffff880224d73800
> ffff880037a9d500
> [12999.574694]  ffff8800c64b1d58 ffffffff8125a355 ffff88019328b018
> ffff88019328aed0
> [12999.574694] Call Trace:
> [12999.574694]  [<ffffffff81073dd0>] ? autoremove_wake_function+0x40/0x40
> [12999.574694]  [<ffffffff8125a355>] nfs_fscache_reset_inode_cookie+0x85/0x100
> [12999.574694]  [<ffffffff81230316>] nfs_revalidate_mapping+0xb6/0x130
> [12999.574694]  [<ffffffff8122e026>] nfs_file_read+0x86/0x130
> [12999.574694]  [<ffffffff8112d4c9>] do_sync_read+0xd9/0x120
> [12999.574694]  [<ffffffff810ea629>] ? __filemap_fdatawrite_range+0x59/0x70
> [12999.574694]  [<ffffffff8113206b>] ? cp_new_stat+0x16b/0x190
> [12999.574694]  [<ffffffff8112e335>] vfs_read+0xf5/0x210
> [12999.574694]  [<ffffffff8112e545>] sys_read+0x55/0x90
> [12999.574694]  [<ffffffff81651074>] system_call_fastpath+0x16/0x1b
> [12999.574694] Code: 7c 6b 4a 00 48 83 c4 18 5b 41 5c 41 5d 41 5e 41
> 5f c9 c3 41 fe 44 24 08 48 c7 c7 88 27 8d 81 49 8b 74 24 18 31 c0 e8
> 55 6b 4a 00 <0f> 0b eb fe 41 8b 55 ac 48 8b 75 d0 48 c7 c7 93 a9 8c 81
> 31 c0
> [12999.574694] RIP  [<ffffffff811a6370>] __fscache_relinquish_cookie+0x1c0/0x3b0
> [12999.574694]  RSP <ffff8800c64b1cd8>
> [12999.564927] [13000.116916] ---[ end trace 67856f08df77c469 ]---
>
>
> I'll grab 3.0.3 tomorrow.
>

Hmm, my 3.0.3 popped but I didn't get the " fs/cachefiles/namei.c"
error from the beginning of the thread. Looks like the same location
as for 3.0.4. Figured I'd post it anyway, in case it's helpful. I'll
clear the fscache cache and let it run on 3.0.3 again.


[ 5460.793832] BUG: unable to handle kernel NULL pointer dereference
at 0000000000000040
[ 5460.803806] IP: [<ffffffff8119e302>]
__fscache_read_or_alloc_pages+0x142/0x370
[ 5460.803806] PGD 104d81067 PUD 22400b067 PMD 0
[ 5460.803806] Oops: 0000 [#1] SMP
[ 5460.803806] CPU 7
[ 5460.803806] Modules linked in: xfs ioatdma dca loop joydev fan
evdev psmouse i5000_edac edac_core dcdbas i5k_amb serio_raw pcspkr
shpchp pci_hotplug ]
[ 5460.803806]
[ 5460.803806] Pid: 18863, comm: httpd Not tainted 3.0.3 #1 Dell Inc.
PowerEdge 1950/0DT097
[ 5460.803806] RIP: 0010:[<ffffffff8119e302>]  [<ffffffff8119e302>]
__fscache_read_or_alloc_pages+0x142/0x370
[ 5460.803806] RSP: 0018:ffff88008fb9fa28  EFLAGS: 00010246
[ 5460.803806] RAX: 0000000000000000 RBX: ffff8800926ee230 RCX: ffff88008fb9e000
[ 5460.803806] RDX: 0000000000000000 RSI: ffff88008fb9f988 RDI: ffff88022ffd2ba0
[ 5460.803806] RBP: ffff88008fb9fa68 R08: ffff88008fb9e000 R09: 0000000000000000
[ 5460.803806] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
[ 5460.803806] R13: ffff8800c9a72300 R14: ffff88008fb9fbf8 R15: ffff88008fb9fad4
[ 5460.803806] FS:  00007f0ef221c6e0(0000) GS:ffff88022fdc0000(0000)
knlGS:0000000000000000
[ 5460.803806] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5460.803806] CR2: 0000000000000040 CR3: 0000000208240000 CR4: 00000000000006e0
[ 5460.803806] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 5460.803806] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 5460.803806] Process httpd (pid: 18863, threadinfo ffff88008fb9e000,
task ffff88004c4c5bc0)
[ 5460.803806] Stack:
[ 5460.803806]  ffff8800c6e22c00 ffffffff8124af80 ffff8801015bd600
00000000ffffff97
[ 5460.803806]  ffff88008fb9fad4 0000000000000001 ffff8800acd2ec38
ffff8800acd2ed80
[ 5460.803806]  ffff88008fb9fab8 ffffffff8124ae30 ffffffff000200da
ffff880200000002
[ 5460.803806] Call Trace:
[ 5460.803806]  [<ffffffff8124af80>] ? __nfs_readpages_from_fscache+0x1c0/0x1c0
[ 5460.803806]  [<ffffffff8124ae30>] __nfs_readpages_from_fscache+0x70/0x1c0
[ 5460.803806]  [<ffffffff8122bdf6>] nfs_readpages+0xd6/0x200
[ 5460.803806]  [<ffffffff815e91b6>] ? rpc_do_put_task+0x36/0x50
[ 5460.803806]  [<ffffffff810ec59a>] __do_page_cache_readahead+0x1da/0x280
[ 5460.803806]  [<ffffffff810ec661>] ra_submit+0x21/0x30
[ 5460.803806]  [<ffffffff810ec88d>] ondemand_readahead+0x11d/0x250
[ 5460.803806]  [<ffffffff810ecab6>] page_cache_sync_readahead+0x36/0x60
[ 5460.803806]  [<ffffffff810e485c>] generic_file_aio_read+0x45c/0x770
[ 5460.803806]  [<ffffffff8121f8c1>] nfs_file_read+0xe1/0x130
[ 5460.803806]  [<ffffffff8103bcef>] ? set_next_entity+0xaf/0xd0
[ 5460.803806]  [<ffffffff81127ec9>] do_sync_read+0xd9/0x120
[ 5460.803806]  [<ffffffff81128c28>] vfs_read+0xc8/0x180
[ 5461.296269]  [<ffffffff81128de5>] sys_read+0x55/0x90
[ 5461.296269]  [<ffffffff8162f06b>] system_call_fastpath+0x16/0x1b
[ 5461.296269] Code: 48 8b 7a 28 ff d0 48 c7 c1 64 55 e5 81 48 c7 c2
6c 55 e5 81 48 8b 75 d0 4c 89 ef e8 d9 fa ff ff 41 89 c4 85 c0 78 3b
49 8b 45 70
[ 5461.296269]  8b 40 40 a8 04 0f 84 32 01 00 00 f0 ff 05 3f 73 cb 00 49 8b
[ 5461.296269] RIP  [<ffffffff8119e302>]
__fscache_read_or_alloc_pages+0x142/0x370
[ 5461.296269]  RSP <ffff88008fb9fa28>
[ 5461.296269] CR2: 0000000000000040
[ 5461.284631] BUG: unable to handle kernel
[ 5461.403124] ---[ end trace 755467f1fed5f5eb ]---

[ 5461.403130] Kernel panic - not syncing: Fatal exception
[ 5461.403136] Pid: 18863, comm: httpd Tainted: G      D     3.0.3 #1
[ 5461.403138] Call Trace:
[ 5461.403150]  [<ffffffff8162402e>] panic+0xbf/0x1e5
[ 5461.403157]  [<ffffffff8104cb3f>] ? kmsg_dump+0x4f/0xa0f>]
page_fault+0x1f/0x30
[ 5461.403203]  [<ffffffff8119e302>] ? __fscache_read_or_alloc_pages+0x142/0x370
[ 5461.403207]  [<ffffffff8119e2f7>] ? __fscache_read_or_alloc_pages+0x137/0x370
[ 5461.403212]  [<ffffffff8124af80>] ? __nfs_readpages_from_fscache+0x1c0/0x1c0
[ 5461.403216]  [<ffffffff8124ae30>] __nfs_readpages_from_fscache+0x70/0x1c0
[ 5461.403221]  [<ffffffff8122bdf6>] nfs_readpages+0xd6/0x200
[ 5461.403227]  [<ffffffff815e91b6>] ? rpc_do_put_task+0x36/0x50
[ 5461.403233]  [<ffffffff810ec59a>] __do_page_cache_readahead+0x1da/0x280
[ 5461.403238]  [<ffffffff810ec661>] ra_submit+0x21/0x30
[ 5461.403242]  [<ffffffff810ec88d>] ondemand_readahead+0x11d/0x250
[ 5461.403246]  [<ffffffff810ecab6>] page_cache_sync_readahead+0x36/0x60
[ 5461.403250]  [<ffffffff810e485c>] generic_file_aio_read+0x45c/0x770
[ 5461.403256]  [<ffffffff8121f8c1>] nfs_file_read+0xe1/0x130
[ 5461.403262]  [<ffffffff8103bcef>] ? set_next_entity+0xaf/0xd0
[ 5461.403268]  [<ffffffff81127ec9>] do_sync_read+0xd9/0x120
[ 5461.403272]  [<ffffffff81128c28>] vfs_read+0xc8/0x180
[ 5461.403275]  [<ffffffff81128de5>] sys_read+0x55/0x90
[ 5461.403279]  [<ffffffff8162f06b>] system_call_fastpath+0x16/0x1b
[ 5461.399263] NULL pointer dereference at 0000000000000040
[ 5461.399263] IP: [<ffffffff8119e302>]
__fscache_read_or_alloc_pages+0x142/0x370
[ 5461.399263] PGD 8fbb4067 PUD 1982e067 PMD 0
[ 5461.399263] Oops: 0000 [#2] SMP
[ 5461.399263] CPU 1
[ 5461.399263] Modules linked in: xfs ioatdma dca loop joydev fan
evdev psmouse i5000_edac edac_core dcdbas i5k_amb serio_raw pcspkr
shpchp pci_hotplug ]
[ 5461.399263]
[ 5461.399263] Pid: 18970, comm: httpd Tainted: G      D     3.0.3 #1
Dell Inc. PowerEdge 1950/0DT097
[ 5461.399263] RIP: 0010:[<ffffffff8119e302>]  [<ffffffff8119e302>]
__fscache_read_or_alloc_pages+0x142/0x370
[ 5461.399263] RSP: 0018:ffff88000eec5a28  EFLAGS: 00010246
[ 5461.399263] RAX: 0000000000000000 RBX: ffff8800926ee230 RCX: ffff88000eec4000
[ 5461.399263] RDX: 0000000000000000 RSI: ffff88000eec5988 RDI: ffff88022ffd2b70
[ 5461.399263] RBP: ffff88000eec5a68 R08: ffff88000eec4000 R09: 0000000000000000
[ 5461.399263] R10: 0000000000000400 R11: 0000000000000001 R12: 0000000000000000
[ 5461.399263] R13: ffff8800c9a72300 R14: ffff88000eec5bf8 R15: ffff88000eec5ad4
[ 5461.399263] FS:  00007f0ef221c6e0(0000) GS:ffff88022fc40000(0000)
knlGS:0000000000000000
[ 5461.399263] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5461.399263] CR2: 0000000000000040 CR3: 000000007d37d000 CR4: 00000000000006e0
[ 5461.399263] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 5461.399263] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 5461.399263] Process httpd (pid: 18970, threadinfo ffff88000eec4000,
task ffff88006fed44d0)
[ 5461.399263] Stack:
[ 5461.399263]  ffff880014d5cd00 ffffffff812

Cuts off at this point.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Linux-cachefs] 3.0.3 64-bit Crash running fscache/cachefilesd
  2011-09-26 21:02       ` Mark Moseley
  2011-09-27  0:59         ` Mark Moseley
@ 2011-09-29 14:57         ` David Howells
  2011-09-29 15:51           ` Mark Moseley
  2011-09-29 16:30           ` David Howells
  1 sibling, 2 replies; 46+ messages in thread
From: David Howells @ 2011-09-29 14:57 UTC (permalink / raw)
  To: Linux filesystem caching discussion list; +Cc: dhowells, linux-kernel

Mark Moseley <moseleymark@gmail.com> wrote:

> Ok, this just popped 15 mins ago. Here's the trace. Some went to IPMI,
> some made it to the logs, so I'll post both (logs first):
> 
> [12999.564897] FS-Cache: Cookie 'NFS.fh' still has outstanding reads

Hopefully the patch set I sent to you will deal with this.

David

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Linux-cachefs] 3.0.3 64-bit Crash running fscache/cachefilesd
  2011-09-29 14:57         ` David Howells
@ 2011-09-29 15:51           ` Mark Moseley
  2011-09-29 16:30           ` David Howells
  1 sibling, 0 replies; 46+ messages in thread
From: Mark Moseley @ 2011-09-29 15:51 UTC (permalink / raw)
  To: David Howells; +Cc: Linux filesystem caching discussion list, linux-kernel

On Thu, Sep 29, 2011 at 7:57 AM, David Howells <dhowells@redhat.com> wrote:
> Mark Moseley <moseleymark@gmail.com> wrote:
>
>> Ok, this just popped 15 mins ago. Here's the trace. Some went to IPMI,
>> some made it to the logs, so I'll post both (logs first):
>>
>> [12999.564897] FS-Cache: Cookie 'NFS.fh' still has outstanding reads
>
> Hopefully the patch set I sent to you will deal with this.
>
> David

Not sure what you mean. Was there another patch besides the one to get
more info in the traceback?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Linux-cachefs] 3.0.3 64-bit Crash running fscache/cachefilesd
  2011-09-29 14:57         ` David Howells
  2011-09-29 15:51           ` Mark Moseley
@ 2011-09-29 16:30           ` David Howells
  2011-09-29 19:02             ` Mark Moseley
  1 sibling, 1 reply; 46+ messages in thread
From: David Howells @ 2011-09-29 16:30 UTC (permalink / raw)
  To: Mark Moseley
  Cc: dhowells, Linux filesystem caching discussion list, linux-kernel

Mark Moseley <moseleymark@gmail.com> wrote:

> >> [12999.564897] FS-Cache: Cookie 'NFS.fh' still has outstanding reads
> >
> > Hopefully the patch set I sent to you will deal with this.
> >
> > David
> 
> Not sure what you mean. Was there another patch besides the one to get
> more info in the traceback?

I emailed a set of 13 patches to you a short time before that.  Hopefully
they'll fix the outstanding-reads problem the bad-page-state problem.

David

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Linux-cachefs] 3.0.3 64-bit Crash running fscache/cachefilesd
  2011-09-29 16:30           ` David Howells
@ 2011-09-29 19:02             ` Mark Moseley
  2011-09-29 22:11               ` Mark Moseley
  2011-09-29 22:44               ` David Howells
  0 siblings, 2 replies; 46+ messages in thread
From: Mark Moseley @ 2011-09-29 19:02 UTC (permalink / raw)
  Cc: Linux filesystem caching discussion list, linux-kernel

On Thu, Sep 29, 2011 at 9:30 AM, David Howells <dhowells@redhat.com> wrote:
> Mark Moseley <moseleymark@gmail.com> wrote:
>
>> >> [12999.564897] FS-Cache: Cookie 'NFS.fh' still has outstanding reads
>> >
>> > Hopefully the patch set I sent to you will deal with this.
>> >
>> > David
>>
>> Not sure what you mean. Was there another patch besides the one to get
>> more info in the traceback?
>
> I emailed a set of 13 patches to you a short time before that.  Hopefully
> they'll fix the outstanding-reads problem the bad-page-state problem.

Sorry, I just had this thread sitting open in gmail and I was looking
for the patch within it. I see all those emails you sent. I'll try to
get those patched in today. Thanks!

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Linux-cachefs] 3.0.3 64-bit Crash running fscache/cachefilesd
  2011-09-29 19:02             ` Mark Moseley
@ 2011-09-29 22:11               ` Mark Moseley
  2011-09-29 22:44                 ` Mark Moseley
  2011-09-29 22:44               ` David Howells
  1 sibling, 1 reply; 46+ messages in thread
From: Mark Moseley @ 2011-09-29 22:11 UTC (permalink / raw)
  Cc: Linux filesystem caching discussion list, linux-kernel

On Thu, Sep 29, 2011 at 12:02 PM, Mark Moseley <moseleymark@gmail.com> wrote:
> On Thu, Sep 29, 2011 at 9:30 AM, David Howells <dhowells@redhat.com> wrote:
>> Mark Moseley <moseleymark@gmail.com> wrote:
>>
>>> >> [12999.564897] FS-Cache: Cookie 'NFS.fh' still has outstanding reads
>>> >
>>> > Hopefully the patch set I sent to you will deal with this.
>>> >
>>> > David
>>>
>>> Not sure what you mean. Was there another patch besides the one to get
>>> more info in the traceback?
>>
>> I emailed a set of 13 patches to you a short time before that.  Hopefully
>> they'll fix the outstanding-reads problem the bad-page-state problem.
>
> Sorry, I just had this thread sitting open in gmail and I was looking
> for the patch within it. I see all those emails you sent. I'll try to
> get those patched in today. Thanks!
>

With those 13 patches, I get this traceback. Nothing before or after
in logs or console.


[ 1499.098817] ------------[ cut here ]------------
[ 1499.108048] kernel BUG at fs/fscache/operation.c:408!
[ 1499.108675] invalid opcode: 0000 [#1] SMP
[ 1499.108675] CPU 2
[ 1499.108675] Modules linked in: xfs ioatdma dca loop joydev evdev
psmouse dcdbas pcspkr serio_raw i5000_edac edac_core i5k_amb shpchp
pci_hotplug sg sr_mod]
[ 1499.108675]
[ 1499.108675] Pid: 8062, comm: httpd Not tainted 3.1.0-rc8 #1 Dell
Inc. PowerEdge 1950/0DT097
[ 1499.108675] RIP: 0010:[<ffffffff81197b24>]  [<ffffffff81197b24>]
fscache_put_operation+0x304/0x330
[ 1499.108675] RSP: 0018:ffff880062f739d8  EFLAGS: 00010296
[ 1499.108675] RAX: 0000000000000025 RBX: ffff8800c5122e84 RCX: ffffffff81ddf040
[ 1499.108675] RDX: 00000000ffffffff RSI: 0000000000000082 RDI: ffffffff81ddef30
[ 1499.108675] RBP: ffff880062f739f8 R08: 0000000000000005 R09: 0000000000000000
[ 1499.108675] R10: 0000000000000000 R11: 0000000000000003 R12: ffff8800c5122e40
[ 1499.108675] R13: ffff880037a2cd20 R14: ffff880087c7a058 R15: ffff880087c7a000
[ 1499.108675] FS:  00007f63dcf636e0(0000) GS:ffff88022fc80000(0000)
knlGS:0000000000000000
[ 1499.108675] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1499.108675] CR2: 00007f0c0a91f000 CR3: 0000000062ec2000 CR4: 00000000000006e0
[ 1499.108675] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1499.108675] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1499.108675] Process httpd (pid: 8062, threadinfo ffff880062f72000,
task ffff880087e58000)
[ 1499.108675] Stack:
[ 1499.108675]  ffff880062f73bf8 0000000000000000 ffff880062f73bf8
ffff880037a2cd20
[ 1499.108675]  ffff880062f73a68 ffffffff8119aa7e ffff88006540e000
ffff880062f73ad4
[ 1499.108675]  ffff88008e9a4308 ffff880037a2cd20 ffff880062f73a48
ffff8800c5122e40
[ 1499.108675] Call Trace:
[ 1499.108675]  [<ffffffff8119aa7e>] __fscache_read_or_alloc_pages+0x1fe/0x530
[ 1499.108675]  [<ffffffff81250780>] __nfs_readpages_from_fscache+0x70/0x1c0
[ 1499.108675]  [<ffffffff8123142a>] nfs_readpages+0xca/0x1e0
[ 1499.108675]  [<ffffffff815f3c06>] ? rpc_do_put_task+0x36/0x50
[ 1499.108675]  [<ffffffff8122755b>] ? alloc_nfs_open_context+0x4b/0x110
[ 1499.108675]  [<ffffffff815ecd1a>] ? rpc_call_sync+0x5a/0x70
[ 1499.108675]  [<ffffffff810e7e9a>] __do_page_cache_readahead+0x1ca/0x270
[ 1499.108675]  [<ffffffff810e7f61>] ra_submit+0x21/0x30
[ 1499.108675]  [<ffffffff810e818d>] ondemand_readahead+0x11d/0x250
[ 1499.108675]  [<ffffffff810e83b6>] page_cache_sync_readahead+0x36/0x60
[ 1499.108675]  [<ffffffff810dffa4>] generic_file_aio_read+0x454/0x770
[ 1499.108675]  [<ffffffff81224ce1>] nfs_file_read+0xe1/0x130
[ 1499.108675]  [<ffffffff81121bd9>] do_sync_read+0xd9/0x120
[ 1499.108675]  [<ffffffff8114088f>] ? mntput+0x1f/0x40
[ 1499.108675]  [<ffffffff811238cb>] ? fput+0x1cb/0x260
[ 1499.600056]  [<ffffffff81122938>] vfs_read+0xc8/0x180
[ 1499.600056]  [<ffffffff81122af5>] sys_read+0x55/0x90

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Linux-cachefs] 3.0.3 64-bit Crash running fscache/cachefilesd
  2011-09-29 19:02             ` Mark Moseley
  2011-09-29 22:11               ` Mark Moseley
@ 2011-09-29 22:44               ` David Howells
  2011-09-29 22:51                 ` Mark Moseley
  2011-09-30 12:28                 ` David Howells
  1 sibling, 2 replies; 46+ messages in thread
From: David Howells @ 2011-09-29 22:44 UTC (permalink / raw)
  To: Linux filesystem caching discussion list; +Cc: dhowells, linux-kernel

Mark Moseley <moseleymark@gmail.com> wrote:

> With those 13 patches, I get this traceback. Nothing before or after
> in logs or console.
> 
> 
> [ 1499.098817] ------------[ cut here ]------------
> [ 1499.108048] kernel BUG at fs/fscache/operation.c:408!

Can you check what's at line 408 please?  If it's what I have in my code,
there should be an 'assertion failed' message with some numbers before it.

David

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Linux-cachefs] 3.0.3 64-bit Crash running fscache/cachefilesd
  2011-09-29 22:11               ` Mark Moseley
@ 2011-09-29 22:44                 ` Mark Moseley
  0 siblings, 0 replies; 46+ messages in thread
From: Mark Moseley @ 2011-09-29 22:44 UTC (permalink / raw)
  Cc: Linux filesystem caching discussion list, linux-kernel

On Thu, Sep 29, 2011 at 3:11 PM, Mark Moseley <moseleymark@gmail.com> wrote:
> On Thu, Sep 29, 2011 at 12:02 PM, Mark Moseley <moseleymark@gmail.com> wrote:
>> On Thu, Sep 29, 2011 at 9:30 AM, David Howells <dhowells@redhat.com> wrote:
>>> Mark Moseley <moseleymark@gmail.com> wrote:
>>>
>>>> >> [12999.564897] FS-Cache: Cookie 'NFS.fh' still has outstanding reads
>>>> >
>>>> > Hopefully the patch set I sent to you will deal with this.
>>>> >
>>>> > David
>>>>
>>>> Not sure what you mean. Was there another patch besides the one to get
>>>> more info in the traceback?
>>>
>>> I emailed a set of 13 patches to you a short time before that.  Hopefully
>>> they'll fix the outstanding-reads problem the bad-page-state problem.
>>
>> Sorry, I just had this thread sitting open in gmail and I was looking
>> for the patch within it. I see all those emails you sent. I'll try to
>> get those patched in today. Thanks!
>>
>
> With those 13 patches, I get this traceback. Nothing before or after
> in logs or console.
>

Got a bit more at the end this time (nothing before though):


[ 2709.769251] ------------[ cut here ]------------
[ 2709.778483] kernel BUG at fs/fscache/operation.c:408!
[ 2709.779147] invalid opcode: 0000 [#1] SMP
[ 2709.779147] CPU 0
[ 2709.779147] Modules linked in: xfs ioatdma dca loop joydev evdev
psmouse serio_raw pcspkr dcdbas i5000_edac edac_core i5k_amb shpchp
pci_hotplug sg sr_mod]
[ 2709.779147]
[ 2709.779147] Pid: 10663, comm: httpd Not tainted 3.1.0-rc8 #1 Dell
Inc. PowerEdge 1950/0DT097
[ 2709.779147] RIP: 0010:[<ffffffff81197b24>]  [<ffffffff81197b24>]
fscache_put_operation+0x304/0x330
[ 2709.779147] RSP: 0018:ffff880126bf79d8  EFLAGS: 00010296
[ 2709.779147] RAX: 0000000000000025 RBX: ffff880111c1c644 RCX: ffffffff81ddf040
[ 2709.779147] RDX: 00000000ffffffff RSI: 0000000000000082 RDI: ffffffff81ddef30
[ 2709.779147] RBP: ffff880126bf79f8 R08: 0000000000000006 R09: 0000000000000000
[ 2709.779147] R10: 0000000000000000 R11: 0000000000000003 R12: ffff880111c1c600
[ 2709.779147] R13: ffff880225e21af0 R14: ffff8800a7f14dd8 R15: ffff8800a7f14d80
[ 2709.779147] FS:  00007fd9b8f626e0(0000) GS:ffff88022fc00000(0000)
knlGS:0000000000000000
[ 2709.779147] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2709.779147] CR2: ffffffffff600400 CR3: 0000000127cf6000 CR4: 00000000000006f0
[ 2709.779147] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2709.779147] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 2709.779147] Process httpd (pid: 10663, threadinfo ffff880126bf6000,
task ffff880009d78000)
[ 2709.779147] Stack:
[ 2709.779147]  ffff880126bf7bf8 0000000000000000 ffff880126bf7bf8
ffff880225e21af0
[ 2709.779147]  ffff880126bf7a68 ffffffff8119aa7e ffff880126bf7a18
ffff880126bf7ad4
[ 2709.779147]  ffff88019af67708 ffff880225e21af0 ffff880223cdfc3c
ffff880111c1c600
[ 2709.779147] Call Trace:
[ 2709.779147]  [<ffffffff8119aa7e>] __fscache_read_or_alloc_pages+0x1fe/0x530
[ 2709.779147]  [<ffffffff81250780>] __nfs_readpages_from_fscache+0x70/0x1c0
[ 2709.779147]  [<ffffffff8123142a>] nfs_readpages+0xca/0x1e0
[ 2709.779147]  [<ffffffff815f3c06>] ? rpc_do_put_task+0x36/0x50
[ 2709.779147]  [<ffffffff815ecd1a>] ? rpc_call_sync+0x5a/0x70
[ 2709.779147]  [<ffffffff810e7e9a>] __do_page_cache_readahead+0x1ca/0x270
[ 2709.779147]  [<ffffffff810e7f61>] ra_submit+0x21/0x30
[ 2709.779147]  [<ffffffff810e818d>] ondemand_readahead+0x11d/0x250
[ 2709.779147]  [<ffffffff810e83b6>] page_cache_sync_readahead+0x36/0x60
[ 2709.779147]  [<ffffffff810dffa4>] generic_file_aio_read+0x454/0x770
[ 2709.779147]  [<ffffffff81224ce1>] nfs_file_read+0xe1/0x130
[ 2709.779147]  [<ffffffff81121bd9>] do_sync_read+0xd9/0x120
[ 2709.779147]  [<ffffffff81122938>] vfs_read+0xc8/0x180
[ 2709.779147]  [<ffffffff81122af5>] sys_read+0x55/0x90
[ 2709.779147]  [<ffffffff8162b32b>] system_call_fastpath+0x16/0x1b
[ 2709.779147] Code: e8 d9 f9 48 00 48 c7 c7 88 1d a1 81 31 c0 e8 cb
f9 48 00 41 8b 74 24 40 ba 05 00 00 00 48 c7 c7 0f 9c a0 81 31 c0 e8
b3 f9 48 00 <0f> 0b
[ 2709.779147] RIP  [<ffffffff81197b24>] fscache_put_operation+0x304/0x330
[ 2709.779147]  RSP <ffff880126bf79d8>
[ 2710.341319] ---[ end trace df840d758e018f42 ]---

2011 Sep 29 18:3[ 2710.353281] Kernel panic - not syncing: Fatal exception
9:25 boscust2102[ 2710.364937] Pid: 10663, comm: httpd Tainted: G
D     3.1.0-rc8 #1
 [ 2709.769251] [ 2710.380560] Call Trace:
------------[ cu[ 2710.388046]  [<ffffffff816273ab>] panic+0xbf/0x1eb
t here ]--------[ 2710.400573]  [<ffffffff8104b98f>] ? kmsg_dump+0x4f/0x100
----
    2011 Sep 29[ 2710.413979]  [<ffffffff81005c08>] oops_end+0xa8/0xf0
 18:39:25 boscus[ 2710.426626]  [<ffffz��[ 2710.529057]
[<ffffffff81197b24>] ? fscache_put_operation+0x304/0x330
[ 2710.542164]  [<ffffffff81197b24>] ? fscache_put_operation+0x304/0x330
[ 2710.555411]  [<ffffffff8119aa7e>] __fscache_read_or_alloc_pages+0x1fe/0x530
[ 2710.569638]  [<ffffffff81250780>] __nfs_readpages_from_fscache+0x70/0x1c0
[ 2710.583413]  [<ffffffff8123142a>] nfs_readpages+0xca/0x1e0
[ 2710.594538]  [<ffffffff815f3c06>] ? rpc_do_put_task+0x36/0x50
[ 2710.606109]  [<ffffffff815ecd1a>] ? rpc_call_sync+0x5a/0x70
[ 2710.617354]  [<ffffffff810e7e9a>] __do_page_cache_readahead+0x1ca/0x270
[ 2710.630667]  [<ffffffff810e7f61>] ra_submit+0x21/0x30
[ 2710.640864]  [<ffffffff810e818d>] ondemand_readahead+0x11d/0x250
[ 2710.653060]  [<ffffffff810e83b6>] page_cache_sync_readahead+0x36/0x60
[ 2710.666126]  [<ffffffff810dffa4>] generic_file_aio_read+0x454/0x770
[ 2710.678961]  [<ffffffff81224ce1>] nfs_file_read+0xe1/0x130
[ 2710.690326]  [<ffffffff81121bd9>] do_sync_read+0xd9/0x120
[ 2710.701253]  [<ffffffff81122938>] vfs_read+0xc8/0x180
[ 2710.711554]  [<ffffffff81122af5>] sys_read+0x55/0x90
[ 2710.721618]  [<ffffffff8162b32b>] system_call_fastpath+0x16/0x1b

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Linux-cachefs] 3.0.3 64-bit Crash running fscache/cachefilesd
  2011-09-29 22:44               ` David Howells
@ 2011-09-29 22:51                 ` Mark Moseley
  2011-09-30 12:28                 ` David Howells
  1 sibling, 0 replies; 46+ messages in thread
From: Mark Moseley @ 2011-09-29 22:51 UTC (permalink / raw)
  To: Linux filesystem caching discussion list; +Cc: linux-kernel

On Thu, Sep 29, 2011 at 3:44 PM, David Howells <dhowells@redhat.com> wrote:
> Mark Moseley <moseleymark@gmail.com> wrote:
>
>> With those 13 patches, I get this traceback. Nothing before or after
>> in logs or console.
>>
>>
>> [ 1499.098817] ------------[ cut here ]------------
>> [ 1499.108048] kernel BUG at fs/fscache/operation.c:408!
>
> Can you check what's at line 408 please?  If it's what I have in my code,
> there should be an 'assertion failed' message with some numbers before it.

Line 408 is the second half of this:

        ASSERTIFCMP(op->state != FSCACHE_OP_ST_COMPLETE,
                    op->state, ==, FSCACHE_OP_ST_CANCELLED);

Ack, you're right. I scanned right by it the first time, looking just
for traces.  Here's the bit from the logs:

[ 2709.769228] FS-Cache: Assertion failed
[ 2709.769231] 3 == 5 is false

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Linux-cachefs] 3.0.3 64-bit Crash running fscache/cachefilesd
  2011-09-29 22:44               ` David Howells
  2011-09-29 22:51                 ` Mark Moseley
@ 2011-09-30 12:28                 ` David Howells
  2011-09-30 18:57                   ` Mark Moseley
                                     ` (4 more replies)
  1 sibling, 5 replies; 46+ messages in thread
From: David Howells @ 2011-09-30 12:28 UTC (permalink / raw)
  To: Linux filesystem caching discussion list; +Cc: dhowells, mark, linux-kernel


You'll probably need to add the attached patch also.  Turns out there were
some bits I'd missed.

David
---
From: David Howells <dhowells@redhat.com>
Subject: [PATCH] CacheFiles: Add missing retrieval completions

CacheFiles is missing some calls to fscache_retrieval_complete() in the error
handling/collision paths of its reader functions.

This can be seen by the following assertion tripping in fscache_put_operation()
whereby the operation being destroyed is still in the in-progress state and has
not been cancelled or completed:

FS-Cache: Assertion failed
3 == 5 is false
------------[ cut here ]------------
kernel BUG at fs/fscache/operation.c:408!
invalid opcode: 0000 [#1] SMP
CPU 2
Modules linked in: xfs ioatdma dca loop joydev evdev
psmouse dcdbas pcspkr serio_raw i5000_edac edac_core i5k_amb shpchp
pci_hotplug sg sr_mod]

Pid: 8062, comm: httpd Not tainted 3.1.0-rc8 #1 Dell Inc. PowerEdge 1950/0DT097
RIP: 0010:[<ffffffff81197b24>]  [<ffffffff81197b24>] fscache_put_operation+0x304/0x330
RSP: 0018:ffff880062f739d8  EFLAGS: 00010296
RAX: 0000000000000025 RBX: ffff8800c5122e84 RCX: ffffffff81ddf040
RDX: 00000000ffffffff RSI: 0000000000000082 RDI: ffffffff81ddef30
RBP: ffff880062f739f8 R08: 0000000000000005 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000003 R12: ffff8800c5122e40
R13: ffff880037a2cd20 R14: ffff880087c7a058 R15: ffff880087c7a000
FS:  00007f63dcf636e0(0000) GS:ffff88022fc80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f0c0a91f000 CR3: 0000000062ec2000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process httpd (pid: 8062, threadinfo ffff880062f72000, task ffff880087e58000)
Stack:
 ffff880062f73bf8 0000000000000000 ffff880062f73bf8 ffff880037a2cd20
 ffff880062f73a68 ffffffff8119aa7e ffff88006540e000 ffff880062f73ad4
 ffff88008e9a4308 ffff880037a2cd20 ffff880062f73a48 ffff8800c5122e40
Call Trace:
 [<ffffffff8119aa7e>] __fscache_read_or_alloc_pages+0x1fe/0x530
 [<ffffffff81250780>] __nfs_readpages_from_fscache+0x70/0x1c0
 [<ffffffff8123142a>] nfs_readpages+0xca/0x1e0
 [<ffffffff815f3c06>] ? rpc_do_put_task+0x36/0x50
 [<ffffffff8122755b>] ? alloc_nfs_open_context+0x4b/0x110
 [<ffffffff815ecd1a>] ? rpc_call_sync+0x5a/0x70
 [<ffffffff810e7e9a>] __do_page_cache_readahead+0x1ca/0x270
 [<ffffffff810e7f61>] ra_submit+0x21/0x30
 [<ffffffff810e818d>] ondemand_readahead+0x11d/0x250
 [<ffffffff810e83b6>] page_cache_sync_readahead+0x36/0x60
 [<ffffffff810dffa4>] generic_file_aio_read+0x454/0x770
 [<ffffffff81224ce1>] nfs_file_read+0xe1/0x130
 [<ffffffff81121bd9>] do_sync_read+0xd9/0x120
 [<ffffffff8114088f>] ? mntput+0x1f/0x40
 [<ffffffff811238cb>] ? fput+0x1cb/0x260
 [<ffffffff81122938>] vfs_read+0xc8/0x180
 [<ffffffff81122af5>] sys_read+0x55/0x90

Reported-by: Mark Moseley <moseleymark@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/cachefiles/rdwr.c |   14 ++++++++++----
 fs/fscache/page.c    |    2 ++
 2 files changed, 12 insertions(+), 4 deletions(-)


diff --git a/fs/cachefiles/rdwr.c b/fs/cachefiles/rdwr.c
index 637a27d..eb9ab4b 100644
--- a/fs/cachefiles/rdwr.c
+++ b/fs/cachefiles/rdwr.c
@@ -361,8 +361,10 @@ out:
 
 read_error:
 	_debug("read error %d", ret);
-	if (ret == -ENOMEM)
+	if (ret == -ENOMEM) {
+		fscache_retrieval_complete(op, 1);
 		goto out;
+	}
 io_error:
 	cachefiles_io_error_obj(object, "Page read error on backing file");
 	fscache_retrieval_complete(op, 1);
@@ -551,6 +553,7 @@ static int cachefiles_read_backing_file(struct cachefiles_object *object,
 		if (ret < 0) {
 			if (ret == -EEXIST) {
 				page_cache_release(netpage);
+				fscache_retrieval_complete(op, 1);
 				continue;
 			}
 			goto nomem;
@@ -627,6 +630,7 @@ static int cachefiles_read_backing_file(struct cachefiles_object *object,
 		if (ret < 0) {
 			if (ret == -EEXIST) {
 				page_cache_release(netpage);
+				fscache_retrieval_complete(op, 1);
 				continue;
 			}
 			goto nomem;
@@ -645,9 +649,9 @@ static int cachefiles_read_backing_file(struct cachefiles_object *object,
 
 		/* the netpage is unlocked and marked up to date here */
 		fscache_end_io(op, netpage, 0);
-		fscache_retrieval_complete(op, 1);
 		page_cache_release(netpage);
 		netpage = NULL;
+		fscache_retrieval_complete(op, 1);
 		continue;
 	}
 
@@ -682,15 +686,17 @@ out:
 nomem:
 	_debug("nomem");
 	ret = -ENOMEM;
-	goto out;
+	goto record_page_complete;
 
 read_error:
 	_debug("read error %d", ret);
 	if (ret == -ENOMEM)
-		goto out;
+		goto record_page_complete;
 io_error:
 	cachefiles_io_error_obj(object, "Page read error on backing file");
 	ret = -ENOBUFS;
+record_page_complete:
+	fscache_retrieval_complete(op, 1);
 	goto out;
 }
 
diff --git a/fs/fscache/page.c b/fs/fscache/page.c
index a30c157..00a5ed9 100644
--- a/fs/fscache/page.c
+++ b/fs/fscache/page.c
@@ -329,6 +329,8 @@ check_if_dead:
 		return -ENOBUFS;
 	}
 	if (unlikely(fscache_object_is_dead(object))) {
+		pr_err("%s() = -ENOBUFS [obj dead %d]", __func__, op->op.state);
+		fscache_cancel_op(&op->op);
 		fscache_stat(stat_object_dead);
 		return -ENOBUFS;
 	}

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [Linux-cachefs] 3.0.3 64-bit Crash running fscache/cachefilesd
  2011-09-30 12:28                 ` David Howells
@ 2011-09-30 18:57                   ` Mark Moseley
  2011-09-30 20:10                   ` David Howells
                                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 46+ messages in thread
From: Mark Moseley @ 2011-09-30 18:57 UTC (permalink / raw)
  To: Linux filesystem caching discussion list; +Cc: linux-kernel

On Fri, Sep 30, 2011 at 5:28 AM, David Howells <dhowells@redhat.com> wrote:
>
> You'll probably need to add the attached patch also.  Turns out there were
> some bits I'd missed.
>
> David
> ---
> From: David Howells <dhowells@redhat.com>
> Subject: [PATCH] CacheFiles: Add missing retrieval completions
>
> CacheFiles is missing some calls to fscache_retrieval_complete() in the error
> handling/collision paths of its reader functions.
>
> This can be seen by the following assertion tripping in fscache_put_operation()
> whereby the operation being destroyed is still in the in-progress state and has
> not been cancelled or completed:
>
> FS-Cache: Assertion failed
> 3 == 5 is false
> ------------[ cut here ]------------
> kernel BUG at fs/fscache/operation.c:408!
> invalid opcode: 0000 [#1] SMP
> CPU 2
> Modules linked in: xfs ioatdma dca loop joydev evdev
> psmouse dcdbas pcspkr serio_raw i5000_edac edac_core i5k_amb shpchp
> pci_hotplug sg sr_mod]
>
> Pid: 8062, comm: httpd Not tainted 3.1.0-rc8 #1 Dell Inc. PowerEdge 1950/0DT097
> RIP: 0010:[<ffffffff81197b24>]  [<ffffffff81197b24>] fscache_put_operation+0x304/0x330
> RSP: 0018:ffff880062f739d8  EFLAGS: 00010296
> RAX: 0000000000000025 RBX: ffff8800c5122e84 RCX: ffffffff81ddf040
> RDX: 00000000ffffffff RSI: 0000000000000082 RDI: ffffffff81ddef30
> RBP: ffff880062f739f8 R08: 0000000000000005 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000003 R12: ffff8800c5122e40
> R13: ffff880037a2cd20 R14: ffff880087c7a058 R15: ffff880087c7a000
> FS:  00007f63dcf636e0(0000) GS:ffff88022fc80000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f0c0a91f000 CR3: 0000000062ec2000 CR4: 00000000000006e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process httpd (pid: 8062, threadinfo ffff880062f72000, task ffff880087e58000)
> Stack:
>  ffff880062f73bf8 0000000000000000 ffff880062f73bf8 ffff880037a2cd20
>  ffff880062f73a68 ffffffff8119aa7e ffff88006540e000 ffff880062f73ad4
>  ffff88008e9a4308 ffff880037a2cd20 ffff880062f73a48 ffff8800c5122e40
> Call Trace:
>  [<ffffffff8119aa7e>] __fscache_read_or_alloc_pages+0x1fe/0x530
>  [<ffffffff81250780>] __nfs_readpages_from_fscache+0x70/0x1c0
>  [<ffffffff8123142a>] nfs_readpages+0xca/0x1e0
>  [<ffffffff815f3c06>] ? rpc_do_put_task+0x36/0x50
>  [<ffffffff8122755b>] ? alloc_nfs_open_context+0x4b/0x110
>  [<ffffffff815ecd1a>] ? rpc_call_sync+0x5a/0x70
>  [<ffffffff810e7e9a>] __do_page_cache_readahead+0x1ca/0x270
>  [<ffffffff810e7f61>] ra_submit+0x21/0x30
>  [<ffffffff810e818d>] ondemand_readahead+0x11d/0x250
>  [<ffffffff810e83b6>] page_cache_sync_readahead+0x36/0x60
>  [<ffffffff810dffa4>] generic_file_aio_read+0x454/0x770
>  [<ffffffff81224ce1>] nfs_file_read+0xe1/0x130
>  [<ffffffff81121bd9>] do_sync_read+0xd9/0x120
>  [<ffffffff8114088f>] ? mntput+0x1f/0x40
>  [<ffffffff811238cb>] ? fput+0x1cb/0x260
>  [<ffffffff81122938>] vfs_read+0xc8/0x180
>  [<ffffffff81122af5>] sys_read+0x55/0x90
>
> Reported-by: Mark Moseley <moseleymark@gmail.com>
> Signed-off-by: David Howells <dhowells@redhat.com>
> ---
>
>  fs/cachefiles/rdwr.c |   14 ++++++++++----
>  fs/fscache/page.c    |    2 ++
>  2 files changed, 12 insertions(+), 4 deletions(-)
>
>
> diff --git a/fs/cachefiles/rdwr.c b/fs/cachefiles/rdwr.c


Ok, with that patch, I get this in the logs:


[ 6052.503856] [kworke] <== __fscache_write_page() = -ENOBUFS [invalidating]
[ 6052.504366] [kworke] <== __fscache_write_page() = -ENOBUFS [invalidating]
[ 6052.504376] [kworke] <== __fscache_write_page() = -ENOBUFS [invalidating]
[ 6052.504380] [kworke] <== __fscache_write_page() = -ENOBUFS [invalidating]
[ 6052.504384] [kworke] <== __fscache_write_page() = -ENOBUFS [invalidating]
[ 6052.504388] [kworke] <== __fscache_write_page() = -ENOBUFS [invalidating]
[ 6052.504391] [kworke] <== __fscache_write_page() = -ENOBUFS [invalidating]
[ 6052.504395] [kworke] <== __fscache_write_page() = -ENOBUFS [invalidating]
[ 6052.504398] [kworke] <== __fscache_write_page() = -ENOBUFS [invalidating]
...
<several pages of this>
...
[ 6052.512921] [kworke] <== __fscache_write_page() = -ENOBUFS [invalidating]
[ 6052.512926] [kworke] <== __fscache_write_page() = -ENOBUFS [invalidating]
[ 6052.512930] [kworke] <== __fscache_write_page() = -ENOBUFS [invalidating]
[ 6052.641682] [kworke] <== __fscache_write_page() = -ENOBUFS [invalidating]
[ 6052.645533] [kworke] <== __fscache_write_page() = -ENOBUFS [invalidating]
[ 6052.645914] [kworke] <== __fscache_write_page() = -ENOBUFS [invalidating]
[ 7719.996876]
[ 7719.996883] FS-Cache: Assertion failed
[ 7719.996886] 3 == 5 is false
[ 7719.996906] ------------[ cut here ]------------
[ 7720.006139] kernel BUG at fs/fscache/operation.c:408!
[ 7720.006751] invalid opcode: 0000 [#1] SMP
[ 7720.006751] CPU 7


Console, I got this (gets a little jumbled at the end):

[ 7719.996906] ------------[ cut here ]------------
[ 7720.006139] kernel BUG at fs/fscache/operation.c:408!
[ 7720.006751] invalid opcode: 0000 [#1] SMP
[ 7720.006751] CPU 7
[ 7720.006751] Modules linked in: xfs ioatdma dca loop joydev evdev
psmouse dcdbas serio_raw pcspkr i5000_edac edac_core i5k_amb shpchp
pci_hotplug sg sr_mod]
[ 7720.060050]
[ 7720.060050] Pid: 21299, comm: kworker/u:2 Not tainted 3.1.0-rc8 #1
Dell Inc. PowerEdge 1950/0DT097
[ 7720.060050] RIP: 0010:[<ffffffff81197b24>]  [<ffffffff81197b24>]
fscache_put_operation+0x304/0x330
[ 7720.060050] RSP: 0018:ffff8800c537ddd0  EFLAGS: 00010286
[ 7720.060050] RAX: 0000000000000025 RBX: ffff8801d8e1c7c4 RCX: ffffffff81ddf040
[ 7720.060050] RDX: 00000000ffffffff RSI: 0000000000000086 RDI: ffffffff81ddef30
[ 7720.060050] RBP: ffff8800c537ddf0 R08: 0000000000000006 R09: 0000000000000000
[ 7720.060050] R10: 0000000000000000 R11: 0000000000000003 R12: ffff8801d8e1c780
[ 7720.060050] R13: ffff8801532f5d80 R14: ffffffff81e01780 R15: ffff880226b95405
[ 7720.060050] FS:  0000000000000000(0000) GS:ffff88022fdc0000(0000)
knlGS:0000000000000000
[ 7720.060050] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 7720.060050] CR2: 00007f69c48bf000 CR3: 000000010b736000 CR4: 00000000000006e0
[ 7720.060050] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 7720.060050] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 7720.060050] Process kworker/u:2 (pid: 21299, threadinfo
ffff8800c537c000, task ffff8801a67f8000)
[ 7720.060050] Stack:
[ 7720.060050]  ffff8801d8e1c780 ffff8801d8e1c780 ffff8801d8e1c780
ffff8801532f5d80
[ 7720.060050]  ffff8800c537de10 ffffffff81197b8b ffffffff81e01780
0000000000000000
[ 7720.060050]  ffff8800c537de60 ffffffff810660c4 0000000000000000
ffff880226b95400
[ 7720.060050] Call Trace:
[ 7720.060050]  [<ffffffff81197b8b>] fscache_op_work_func+0x3b/0xd0
[ 7720.060050]  [<ffffffff810660c4>] process_one_work+0x164/0x440
[ 7720.060050]  [<ffffffff81197b50>] ? fscache_put_operation+0x330/0x330
[ 7720.060050]  [<ffffffff8106674b>] worker_thread+0x19b/0x430
[ 7720.060050]  [<ffffffff810665b0>] ? manage_workers+0x210/0x210
[ 7720.060050]  [<ffffffff8106d11e>] kthread+0x9e/0xb0
[ 7720.060050]  [<ffffffff8162d4f4>] kernel_thread_helper+0x4/0x10
[ 7720.060050]  [<ffffffff8162ab8a>] ? retint_restore_args+0xe/0xe
[ 7720.060050]  [<ffffffff8106d080>] ? kthread_worker_fn+0x190/0x190
[ 7720.060050]  [<ffffffff8162d4f0>] ? gs_change+0xb/0xb
[ 7720.060050] Code: e8 59 fa 48 00 48 c7 c7 88 1d a1 81 31 c0 e8 4b
fa 48 00 41 8b 74 24 40 ba 05 00 00 00 48 c7 c7 0f 9c a0 81 31 c0 e8
33 fa 48 00 <0f> 0b
[ 7720.060050] RIP  [<ffffffff81197b24>] fscache_put_operation+0x304/0x330
[ 7720.060050]  RSP <ffff8800c537ddd0>
[ 7719.996906] [ 7720.512787] ---[ end trace d88afc9d74f3048c ]---
------------[ cu[ 7720.526298] Kernel panic - not syncing: Fatal exception
t here ]--------[ 7720.538799] Pid: 21299, comm: kworker/u:2 Tainted:
G      D     3.1.0-rc8 #1
----
    2011 Sep 30[ 7720.555240] Call Trace:
 14:40:56 [ 7720.562582]  [<ffffffff8162742b>] panic+0xbf/0x1eb
t2102 [ 7720.006[ 7720.575051]  [<ffffffff8104b98f>] ? kmsg_dump+0x4f/0x100
751] invalid opc[ 7720.587959]  [<ffffffff81005c08>] oops_end+0xa8/0xf0
ode: 0000 [#1] S[ 7720.600553]  [<ffffffff81005d4b>] die+0x5b/0x90
MP
   [ 7720.612527]  [<ffffffff81003386>] do_trap+0x156/0x180
[ 7720.623060]  [<ffffffff81072cea>] ? atomic_notifier_call_chain+0x1a/0x20
[ 7720.636598]  [<ffffffff81003805>] do_invalid_op+0x95/0xb0
[ 7720.647485]  [<ffffffff81197b24>] ? fscache_put_operation+0x304/0x330
[ 7720.660477]  [<ffffffff812dc58a>] ? trace_hardirqs_off_thunk+0x3a/0x6c
[ 7720.673634]  [<ffffffff8162abba>] ? restore_args+0x30/0x30
[ 7720.684682]  [<ffffffff8162d375>] invalid_op+0x15/0x20
[ 7720.695008]  [<ffffffff81197b24>] ? fscache_put_operation+0x304/0[
7720.835000] Rebooting in 120 seconds..

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Linux-cachefs] 3.0.3 64-bit Crash running fscache/cachefilesd
  2011-09-30 12:28                 ` David Howells
  2011-09-30 18:57                   ` Mark Moseley
@ 2011-09-30 20:10                   ` David Howells
  2011-10-05 13:37                   ` David Howells
                                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 46+ messages in thread
From: David Howells @ 2011-09-30 20:10 UTC (permalink / raw)
  To: Linux filesystem caching discussion list; +Cc: dhowells, linux-kernel

Mark Moseley <moseleymark@gmail.com> wrote:

> [ 6052.503856] [kworke] <== __fscache_write_page() = -ENOBUFS [invalidating]

Nothing to worry about.  I have some kleave() calls to turn into _leave().

> [ 7720.060050]  [<ffffffff81197b24>] fscache_put_operation+0x304/0x330
> [ 7720.060050]  [<ffffffff81197b8b>] fscache_op_work_func+0x3b/0xd0
> [ 7720.060050]  [<ffffffff810660c4>] process_one_work+0x164/0x440
> [ 7720.060050]  [<ffffffff81197b50>] ? fscache_put_operation+0x330/0x330
> [ 7720.060050]  [<ffffffff8106674b>] worker_thread+0x19b/0x430
> [ 7720.060050]  [<ffffffff810665b0>] ? manage_workers+0x210/0x210
> [ 7720.060050]  [<ffffffff8106d11e>] kthread+0x9e/0xb0
> [ 7720.060050]  [<ffffffff8162d4f4>] kernel_thread_helper+0x4/0x10
> [ 7720.060050]  [<ffffffff8162ab8a>] ? retint_restore_args+0xe/0xe
> [ 7720.060050]  [<ffffffff8106d080>] ? kthread_worker_fn+0x190/0x190
> [ 7720.060050]  [<ffffffff8162d4f0>] ? gs_change+0xb/0xb

Hmmm...  That's a new one.  Looks like I've missed another op put.

Thanks for testing!

David

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Linux-cachefs] 3.0.3 64-bit Crash running fscache/cachefilesd
  2011-09-30 12:28                 ` David Howells
  2011-09-30 18:57                   ` Mark Moseley
  2011-09-30 20:10                   ` David Howells
@ 2011-10-05 13:37                   ` David Howells
  2011-10-05 13:49                   ` David Howells
  2011-10-07 10:42                   ` David Howells
  4 siblings, 0 replies; 46+ messages in thread
From: David Howells @ 2011-10-05 13:37 UTC (permalink / raw)
  To: Linux filesystem caching discussion list; +Cc: dhowells, linux-kernel

Mark Moseley <moseleymark@gmail.com> wrote:

> [ 7719.996883] FS-Cache: Assertion failed
> [ 7719.996886] 3 == 5 is false
> [ 7719.996906] ------------[ cut here ]------------
> [ 7720.006139] kernel BUG at fs/fscache/operation.c:408!
> [ 7720.006751] invalid opcode: 0000 [#1] SMP
> [ 7720.006751] CPU 7

Ummm... what's at line 408 for you?  It doesn't seem to be the same as what
I've got.

Do you have all my patches applied?

David

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Linux-cachefs] 3.0.3 64-bit Crash running fscache/cachefilesd
  2011-09-30 12:28                 ` David Howells
                                     ` (2 preceding siblings ...)
  2011-10-05 13:37                   ` David Howells
@ 2011-10-05 13:49                   ` David Howells
  2011-10-07 10:42                   ` David Howells
  4 siblings, 0 replies; 46+ messages in thread
From: David Howells @ 2011-10-05 13:49 UTC (permalink / raw)
  Cc: dhowells, Linux filesystem caching discussion list, linux-kernel

David Howells <dhowells@redhat.com> wrote:

> > [ 7719.996883] FS-Cache: Assertion failed
> > [ 7719.996886] 3 == 5 is false
> > [ 7719.996906] ------------[ cut here ]------------
> > [ 7720.006139] kernel BUG at fs/fscache/operation.c:408!
> > [ 7720.006751] invalid opcode: 0000 [#1] SMP
> > [ 7720.006751] CPU 7
> 
> Ummm... what's at line 408 for you?  It doesn't seem to be the same as what
> I've got.
> 
> Do you have all my patches applied?

Never mind.  I've got a patch applied that I haven't sent out yet.

David

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Linux-cachefs] 3.0.3 64-bit Crash running fscache/cachefilesd
  2011-09-30 12:28                 ` David Howells
                                     ` (3 preceding siblings ...)
  2011-10-05 13:49                   ` David Howells
@ 2011-10-07 10:42                   ` David Howells
  2011-10-08 16:32                     ` Mark Moseley
  2011-10-11 13:07                     ` David Howells
  4 siblings, 2 replies; 46+ messages in thread
From: David Howells @ 2011-10-07 10:42 UTC (permalink / raw)
  To: Linux filesystem caching discussion list; +Cc: dhowells, linux-kernel

Mark Moseley <moseleymark@gmail.com> wrote:

> [ 7719.996883] FS-Cache: Assertion failed
> [ 7719.996886] 3 == 5 is false
> [ 7719.996906] ------------[ cut here ]------------
> [ 7720.006139] kernel BUG at fs/fscache/operation.c:408!

Can you add the attached patch?  It will display which operation was being run
just before displaying the above assertion.

David
---
From: David Howells <dhowells@redhat.com>
Subject: [PATCH] FS-Cache: Give operations names for debugging

Give operations names for debugging and print it if we're going to assert.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/fscache/object.c           |    1 +
 fs/fscache/operation.c        |   18 ++++++++++++++++++
 fs/fscache/page.c             |    5 +++++
 include/linux/fscache-cache.h |   14 +++++++++++++-
 4 files changed, 37 insertions(+), 1 deletions(-)


diff --git a/fs/fscache/object.c b/fs/fscache/object.c
index 80b5491..91d998b 100644
--- a/fs/fscache/object.c
+++ b/fs/fscache/object.c
@@ -940,6 +940,7 @@ static void fscache_invalidate_object(struct fscache_object *object)
 	}
 
 	fscache_operation_init(op, object->cache->ops->invalidate_object, NULL);
+	op->name = FSCACHE_OP_INVALIDATE;
 	op->flags = FSCACHE_OP_ASYNC | (1 << FSCACHE_OP_EXCLUSIVE);
 
 	spin_lock(&cookie->lock);
diff --git a/fs/fscache/operation.c b/fs/fscache/operation.c
index 2037f03..6bfefee 100644
--- a/fs/fscache/operation.c
+++ b/fs/fscache/operation.c
@@ -20,6 +20,16 @@
 atomic_t fscache_op_debug_id;
 EXPORT_SYMBOL(fscache_op_debug_id);
 
+static const char *const fscache_op_names[FSCACHE_OP__NR] = {
+	[FSCACHE_OP_UNNAMED]		= "Unnamed",
+	[FSCACHE_OP_INVALIDATE]		= "Invalidate",
+	[FSCACHE_OP_ATTR_CHANGED]	= "AttrChanged",
+	[FSCACHE_OP_ALLOC_PAGE]		= "AllocPage",
+	[FSCACHE_OP_READ_OR_ALLOC_PAGE]	= "ReadOrAllocPage",
+	[FSCACHE_OP_READ_OR_ALLOC_PAGES] = "ReadOrAllocPages",
+	[FSCACHE_OP_WRITE]		= "Write",
+};
+
 /**
  * fscache_enqueue_operation - Enqueue an operation for processing
  * @op: The operation to enqueue
@@ -86,6 +96,7 @@ int fscache_submit_exclusive_op(struct fscache_object *object,
 {
 	_enter("{OBJ%x OP%x},", object->debug_id, op->debug_id);
 
+	ASSERTCMP(op->name, >, FSCACHE_OP_UNNAMED);
 	ASSERTCMP(op->state, ==, FSCACHE_OP_ST_INITIALISED);
 	ASSERTCMP(atomic_read(&op->usage), >, 0);
 
@@ -189,6 +200,7 @@ int fscache_submit_op(struct fscache_object *object,
 	_enter("{OBJ%x OP%x},{%u}",
 	       object->debug_id, op->debug_id, atomic_read(&op->usage));
 
+	ASSERTCMP(op->name, >, FSCACHE_OP_UNNAMED);
 	ASSERTCMP(op->state, ==, FSCACHE_OP_ST_INITIALISED);
 	ASSERTCMP(atomic_read(&op->usage), >, 0);
 
@@ -404,6 +416,12 @@ void fscache_put_operation(struct fscache_operation *op)
 		return;
 
 	_debug("PUT OP");
+
+	if (op->state != FSCACHE_OP_ST_COMPLETE &&
+	    op->state != FSCACHE_OP_ST_CANCELLED)
+		printk("FS-Cache: Asserting on %s operation\n",
+		       fscache_op_names[op->name]);
+
 	ASSERTIFCMP(op->state != FSCACHE_OP_ST_COMPLETE,
 		    op->state, ==, FSCACHE_OP_ST_CANCELLED);
 	op->state = FSCACHE_OP_ST_DEAD;
diff --git a/fs/fscache/page.c b/fs/fscache/page.c
index 00a5ed9..cf6dd34 100644
--- a/fs/fscache/page.c
+++ b/fs/fscache/page.c
@@ -188,6 +188,7 @@ int __fscache_attr_changed(struct fscache_cookie *cookie)
 	}
 
 	fscache_operation_init(op, fscache_attr_changed_op, NULL);
+	op->name = FSCACHE_OP_ATTR_CHANGED;
 	op->flags = FSCACHE_OP_ASYNC | (1 << FSCACHE_OP_EXCLUSIVE);
 
 	spin_lock(&cookie->lock);
@@ -379,6 +380,7 @@ int __fscache_read_or_alloc_page(struct fscache_cookie *cookie,
 		_leave(" = -ENOMEM");
 		return -ENOMEM;
 	}
+	op->op.name = FSCACHE_OP_READ_OR_ALLOC_PAGE;
 	op->n_pages = 1;
 
 	spin_lock(&cookie->lock);
@@ -505,6 +507,7 @@ int __fscache_read_or_alloc_pages(struct fscache_cookie *cookie,
 	op = fscache_alloc_retrieval(mapping, end_io_func, context);
 	if (!op)
 		return -ENOMEM;
+	op->op.name = FSCACHE_OP_READ_OR_ALLOC_PAGES;
 	op->n_pages = *nr_pages;
 
 	spin_lock(&cookie->lock);
@@ -635,6 +638,7 @@ int __fscache_alloc_page(struct fscache_cookie *cookie,
 	op = fscache_alloc_retrieval(page->mapping, NULL, NULL);
 	if (!op)
 		return -ENOMEM;
+	op->op.name = FSCACHE_OP_ALLOC_PAGE;
 	op->n_pages = 1;
 
 	spin_lock(&cookie->lock);
@@ -856,6 +860,7 @@ int __fscache_write_page(struct fscache_cookie *cookie,
 
 	fscache_operation_init(&op->op, fscache_write_op,
 			       fscache_release_write_op);
+	op->op.name = FSCACHE_OP_WRITE;
 	op->op.flags = FSCACHE_OP_ASYNC | (1 << FSCACHE_OP_WAITING);
 
 	ret = radix_tree_preload(gfp & ~__GFP_HIGHMEM);
diff --git a/include/linux/fscache-cache.h b/include/linux/fscache-cache.h
index 29f552d..fa61436 100644
--- a/include/linux/fscache-cache.h
+++ b/include/linux/fscache-cache.h
@@ -85,6 +85,17 @@ enum fscache_operation_state {
 	FSCACHE_OP_ST_DEAD		/* Op is now dead */
 };
 
+enum fscache_operation_name {
+	FSCACHE_OP_UNNAMED,
+	FSCACHE_OP_INVALIDATE,
+	FSCACHE_OP_ATTR_CHANGED,
+	FSCACHE_OP_ALLOC_PAGE,
+	FSCACHE_OP_READ_OR_ALLOC_PAGE,
+	FSCACHE_OP_READ_OR_ALLOC_PAGES,
+	FSCACHE_OP_WRITE,
+	FSCACHE_OP__NR
+};
+
 struct fscache_operation {
 	struct work_struct	work;		/* record for async ops */
 	struct list_head	pend_link;	/* link in object->pending_ops */
@@ -99,7 +110,8 @@ struct fscache_operation {
 #define FSCACHE_OP_DEC_READ_CNT	6	/* decrement object->n_reads on destruction */
 #define FSCACHE_OP_KEEP_FLAGS	0x0070	/* flags to keep when repurposing an op */
 
-	enum fscache_operation_state state;
+	enum fscache_operation_state state : 8;
+	enum fscache_operation_name name : 8;
 	atomic_t		usage;
 	unsigned		debug_id;	/* debugging ID */
 

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [Linux-cachefs] 3.0.3 64-bit Crash running fscache/cachefilesd
  2011-10-07 10:42                   ` David Howells
@ 2011-10-08 16:32                     ` Mark Moseley
  2011-10-11 13:07                     ` David Howells
  1 sibling, 0 replies; 46+ messages in thread
From: Mark Moseley @ 2011-10-08 16:32 UTC (permalink / raw)
  To: David Howells; +Cc: Linux filesystem caching discussion list, linux-kernel

On Fri, Oct 7, 2011 at 3:42 AM, David Howells <dhowells@redhat.com> wrote:
> Mark Moseley <moseleymark@gmail.com> wrote:
>
>> [ 7719.996883] FS-Cache: Assertion failed
>> [ 7719.996886] 3 == 5 is false
>> [ 7719.996906] ------------[ cut here ]------------
>> [ 7720.006139] kernel BUG at fs/fscache/operation.c:408!
>
> Can you add the attached patch?  It will display which operation was being run
> just before displaying the above assertion.
>
> David
> ---
> From: David Howells <dhowells@redhat.com>
> Subject: [PATCH] FS-Cache: Give operations names for debugging
>
> Give operations names for debugging and print it if we're going to assert.
>
> Signed-off-by: David Howells <dhowells@redhat.com>
> ---
>
>  fs/fscache/object.c           |    1 +
>  fs/fscache/operation.c        |   18 ++++++++++++++++++
>  fs/fscache/page.c             |    5 +++++
>  include/linux/fscache-cache.h |   14 +++++++++++++-
>  4 files changed, 37 insertions(+), 1 deletions(-)
>
>
> diff --git a/fs/fscache/object.c b/fs/fscache/object.c
> index 80b5491..91d998b 100644
> --- a/fs/fscache/object.c
> +++ b/fs/fscache/object.c
> @@ -940,6 +940,7 @@ static void fscache_invalidate_object(struct fscache_object *object)
>        }
>
>        fscache_operation_init(op, object->cache->ops->invalidate_object, NULL);
> +       op->name = FSCACHE_OP_INVALIDATE;
>        op->flags = FSCACHE_OP_ASYNC | (1 << FSCACHE_OP_EXCLUSIVE);
>
>        spin_lock(&cookie->lock);
> diff --git a/fs/fscache/operation.c b/fs/fscache/operation.c
> index 2037f03..6bfefee 100644
> --- a/fs/fscache/operation.c
> +++ b/fs/fscache/operation.c
> @@ -20,6 +20,16 @@
>  atomic_t fscache_op_debug_id;
>  EXPORT_SYMBOL(fscache_op_debug_id);
>
> +static const char *const fscache_op_names[FSCACHE_OP__NR] = {
> +       [FSCACHE_OP_UNNAMED]            = "Unnamed",
> +       [FSCACHE_OP_INVALIDATE]         = "Invalidate",
> +       [FSCACHE_OP_ATTR_CHANGED]       = "AttrChanged",
> +       [FSCACHE_OP_ALLOC_PAGE]         = "AllocPage",
> +       [FSCACHE_OP_READ_OR_ALLOC_PAGE] = "ReadOrAllocPage",
> +       [FSCACHE_OP_READ_OR_ALLOC_PAGES] = "ReadOrAllocPages",
> +       [FSCACHE_OP_WRITE]              = "Write",
> +};
> +
>  /**
>  * fscache_enqueue_operation - Enqueue an operation for processing
>  * @op: The operation to enqueue
> @@ -86,6 +96,7 @@ int fscache_submit_exclusive_op(struct fscache_object *object,
>  {
>        _enter("{OBJ%x OP%x},", object->debug_id, op->debug_id);
>
> +       ASSERTCMP(op->name, >, FSCACHE_OP_UNNAMED);
>        ASSERTCMP(op->state, ==, FSCACHE_OP_ST_INITIALISED);
>        ASSERTCMP(atomic_read(&op->usage), >, 0);
>
> @@ -189,6 +200,7 @@ int fscache_submit_op(struct fscache_object *object,
>        _enter("{OBJ%x OP%x},{%u}",
>               object->debug_id, op->debug_id, atomic_read(&op->usage));
>
> +       ASSERTCMP(op->name, >, FSCACHE_OP_UNNAMED);
>        ASSERTCMP(op->state, ==, FSCACHE_OP_ST_INITIALISED);
>        ASSERTCMP(atomic_read(&op->usage), >, 0);
>
> @@ -404,6 +416,12 @@ void fscache_put_operation(struct fscache_operation *op)
>                return;
>
>        _debug("PUT OP");
> +
> +       if (op->state != FSCACHE_OP_ST_COMPLETE &&
> +           op->state != FSCACHE_OP_ST_CANCELLED)
> +               printk("FS-Cache: Asserting on %s operation\n",
> +                      fscache_op_names[op->name]);
> +
>        ASSERTIFCMP(op->state != FSCACHE_OP_ST_COMPLETE,
>                    op->state, ==, FSCACHE_OP_ST_CANCELLED);
>        op->state = FSCACHE_OP_ST_DEAD;
> diff --git a/fs/fscache/page.c b/fs/fscache/page.c
> index 00a5ed9..cf6dd34 100644
> --- a/fs/fscache/page.c
> +++ b/fs/fscache/page.c
> @@ -188,6 +188,7 @@ int __fscache_attr_changed(struct fscache_cookie *cookie)
>        }
>
>        fscache_operation_init(op, fscache_attr_changed_op, NULL);
> +       op->name = FSCACHE_OP_ATTR_CHANGED;
>        op->flags = FSCACHE_OP_ASYNC | (1 << FSCACHE_OP_EXCLUSIVE);
>
>        spin_lock(&cookie->lock);
> @@ -379,6 +380,7 @@ int __fscache_read_or_alloc_page(struct fscache_cookie *cookie,
>                _leave(" = -ENOMEM");
>                return -ENOMEM;
>        }
> +       op->op.name = FSCACHE_OP_READ_OR_ALLOC_PAGE;
>        op->n_pages = 1;
>
>        spin_lock(&cookie->lock);
> @@ -505,6 +507,7 @@ int __fscache_read_or_alloc_pages(struct fscache_cookie *cookie,
>        op = fscache_alloc_retrieval(mapping, end_io_func, context);
>        if (!op)
>                return -ENOMEM;
> +       op->op.name = FSCACHE_OP_READ_OR_ALLOC_PAGES;
>        op->n_pages = *nr_pages;
>
>        spin_lock(&cookie->lock);
> @@ -635,6 +638,7 @@ int __fscache_alloc_page(struct fscache_cookie *cookie,
>        op = fscache_alloc_retrieval(page->mapping, NULL, NULL);
>        if (!op)
>                return -ENOMEM;
> +       op->op.name = FSCACHE_OP_ALLOC_PAGE;
>        op->n_pages = 1;
>
>        spin_lock(&cookie->lock);
> @@ -856,6 +860,7 @@ int __fscache_write_page(struct fscache_cookie *cookie,
>
>        fscache_operation_init(&op->op, fscache_write_op,
>                               fscache_release_write_op);
> +       op->op.name = FSCACHE_OP_WRITE;
>        op->op.flags = FSCACHE_OP_ASYNC | (1 << FSCACHE_OP_WAITING);
>
>        ret = radix_tree_preload(gfp & ~__GFP_HIGHMEM);
> diff --git a/include/linux/fscache-cache.h b/include/linux/fscache-cache.h
> index 29f552d..fa61436 100644
> --- a/include/linux/fscache-cache.h
> +++ b/include/linux/fscache-cache.h
> @@ -85,6 +85,17 @@ enum fscache_operation_state {
>        FSCACHE_OP_ST_DEAD              /* Op is now dead */
>  };
>
> +enum fscache_operation_name {
> +       FSCACHE_OP_UNNAMED,
> +       FSCACHE_OP_INVALIDATE,
> +       FSCACHE_OP_ATTR_CHANGED,
> +       FSCACHE_OP_ALLOC_PAGE,
> +       FSCACHE_OP_READ_OR_ALLOC_PAGE,
> +       FSCACHE_OP_READ_OR_ALLOC_PAGES,
> +       FSCACHE_OP_WRITE,
> +       FSCACHE_OP__NR
> +};
> +
>  struct fscache_operation {
>        struct work_struct      work;           /* record for async ops */
>        struct list_head        pend_link;      /* link in object->pending_ops */
> @@ -99,7 +110,8 @@ struct fscache_operation {
>  #define FSCACHE_OP_DEC_READ_CNT        6       /* decrement object->n_reads on destruction */
>  #define FSCACHE_OP_KEEP_FLAGS  0x0070  /* flags to keep when repurposing an op */
>
> -       enum fscache_operation_state state;
> +       enum fscache_operation_state state : 8;
> +       enum fscache_operation_name name : 8;
>        atomic_t                usage;
>        unsigned                debug_id;       /* debugging ID */
>
> --


Here's the latest. I'd emptied out the cache first, so it stayed up
for quite a while.

Console:

[37818.485559] kernel BUG at fs/fscache/object-list.c:64!
[37818.486239] invalid opcode: 0000 [#1] SMP
[37818.486239] CPU 4
[37818.486239] Modules linked in: xfs ioatdma dca loop joydev evdev
psmouse dcdbas pcspkr i5000_edac serio_raw edac_core shpchp i5k_amb
pci_hotplug sg sr_mod cdrom ehci_hcd uhci_hcd sd_mod crc_t10dif [last
unloaded: scsi_wait_scan]
[37818.486239]
[37818.486239] Pid: 29993, comm: httpd Not tainted 3.1.0-rc8 #1 Dell
Inc. PowerEdge 1950/0DT097
[37818.486239] RIP: 0010:[<ffffffff8119c548>]  [<ffffffff8119c548>]
fscache_objlist_add+0xa8/0xb0
[37818.486239] RSP: 0018:ffff88002233bb28  EFLAGS: 00010246
[37818.486239] RAX: ffff880040861380 RBX: ffff880040861380 RCX: ffff880040860fd0
[37818.486239] RDX: ffff880040861450 RSI: ffff880223eae000 RDI: ffffffff81c94b90
[37818.486239] RBP: ffff88002233bb38 R08: ffff880010e74680 R09: 0000000000000069
[37818.486239] R10: ffff880010e746e2 R11: ffff880010e746e1 R12: ffff8800408613c8
[37818.486239] R13: ffff88021ded9000 R14: ffff88021a8682d0 R15: ffff880040861380
[37818.486239] FS:  00007fbabf4e36e0(0000) GS:ffff88022fd00000(0000)
knlGS:0000000000000000
[37818.486239] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[37818.486239] CR2: ffffffffff600400 CR3: 000000012e062000 CR4: 00000000000006e0
[37818.486239] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[37818.486239] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[37818.486239] Process httpd (pid: 29993, threadinfo ffff88002233a000,
task ffff8800c70944a0)
[37818.486239] Stack:
[37818.486239]  ffff88021a8682d0 0000000000000000 ffff88002233bb88
ffffffff81194bbe
[37818.486239]  ffff88002233bbc8 ffff88021a8682d8 ffff88021ded9000
ffff88021ded9000
[37818.486239]  ffff880105ad18c0 ffff88021ded9000 ffff880105ad18c0
ffff880040861200
[37818.486239] Call Trace:
[37818.486239]  [<ffffffff81194bbe>] fscache_alloc_object+0x20e/0x460
[37818.486239]  [<ffffffff81194ac4>] fscache_alloc_object+0x114/0x460
[37818.486239]  [<ffffffff81194f6c>] __fscache_acquire_cookie+0x15c/0x610
[37818.486239]  [<ffffffff81224800>] ? nfs_file_release+0x80/0x80
[37818.486239]  [<ffffffff812510dc>] nfs_fscache_set_inode_cookie+0x10c/0x170
[37818.486239]  [<ffffffff812277a5>] nfs_open+0x95/0xa0
[37818.486239]  [<ffffffff81224857>] nfs_file_open+0x57/0x90
[37818.486239]  [<ffffffff8112040a>] __dentry_open+0x17a/0x310
[37818.486239]  [<ffffffff8112069b>] nameidata_to_filp+0x7b/0x80
[37818.486239]  [<ffffffff81130895>] do_last+0x305/0x8f0
[37818.486239]  [<ffffffff81130f54>] path_openat+0xd4/0x3f0
[37818.486239]  [<ffffffff81131398>] do_filp_open+0x48/0xa0
[37818.486239]  [<ffffffff8113d912>] ? alloc_fd+0x52/0x150
[37818.486239]  [<ffffffff81120052>] do_sys_open+0x152/0x1e0
[37818.486239]  [<ffffffff81120120>] sys_open+0x20/0x30
[37818.486239]  [<ffffffff8162b46b>] system_call_fastpath+0x16/0x1b
[37818.486239] Code: 00 00 00 00 00 00 00 48 c7 c6 78 7e e3 81 48 89
38 e8 8d a4 13 00 f0 81 05 52 86 af 00 00 00 10 00 48 83 c4 08 5b c9
c3 0f 1f 00 <0f> 0b eb fe 90 90 90 90 55 48 89 e5 66 66 66 66 90 48 8b
87 b0
[37818.486239] RIP  [<ffffffff8119c548>] fscache_objlist_add+0xa8/0xb0
[37818.486239]  RSP <ffff88002233bb28>
[37819.051624] ---[ end trace 6a206d80f756f517 ]---

(Gets a bit garbled here)

[37819.062840] Kernel panic - not syncing: Fatal exception
[37819.073661] Pid: 29993, comm: httpd Tainted: G      D     3.1.0-rc8 #1
[37819.087096] Call Trace:
[37819.092154]  [<ffffffff816274eb>] panic+0xbf/0x1eb
2011 Oct  7 23:4[37819.103855]  [<ffffffff8104b98f>] ? kmsg_dump+0x4f/0x100
3:13 boscust2102[37819.115413]  [<ffffffff81005c08>] oops_end+0xa8/0xf0
 [37818.476323] [37819.128645]  [<ffffffff81005d4b>] die+0x5b/0x90
------------[ cu[37819.140099]  [<ffffffff81003386>] do_trap+0x156/0x180
t here ]--------[37819.152708]  [<ffffffff81072cea>] ?
atomic_notifier_call_chain+0x1a/0x20
----
    2011 Oct  7[37819.169172]  [<ffffffff81003805>] do_invalid_op+0x95/0xb0
 23:43:13 boscus[37819.182466]  [<ffffffff8119c548>] ?
fscache_objlist_add+0xa8/0xb0
t2102 [37818.486[37819.197295]  [<ffffffff812dc64a>] ?
trace_hardirqs_off_thunk+0x3a/0x6c
239] invalid opc[37819.213052]  [<ffffffff8162ac7a>] ? restore_args+0x30/0x30
ode: 0000 [#1] S[37819.226799]  [<ffffffff8162d435>] invalid_op+0x15/0x20
MP
   [37819.239844]  [<ffffffff8119c548>][37819.458030]


>From logs:


kernel: [10612.850523] [httpd ] <== __fscache_read_or_alloc_pages() =
-ENOBUFS [invalidating]
kernel: [11884.736750] [httpd ] <== __fscache_read_or_alloc_pages() =
-ENOBUFS [invalidating]
kernel: [18237.236009] [httpd ] <== __fscache_read_or_alloc_pages() =
-ENOBUFS [invalidating]
kernel: [33519.922973] [httpd ] <== __fscache_read_or_alloc_pages() =
-ENOBUFS [invalidating]
kernel: [36447.036182] [httpd ] <== __fscache_read_or_alloc_pages() =
-ENOBUFS [invalidating]
kernel: [37818.476323] ------------[ cut here ]------------
kernel: [37818.485559] kernel BUG at fs/fscache/object-list.c:64!
kernel: [37818.486239] invalid opcode: 0000 [#1] SMP
kernel: [37818.486239] CPU 4
kernel: [37818.486239] Modules linked in: xfs ioatdma dca loop joydev
evdev psmouse dcdbas pcspkr i5000_edac serio_raw edac_core
g sr_mod cdrom ehci_hcd uhci_hcd sd_mod crc_t10dif [last unloaded:
scsi_wait_scan]
kernel: [37818.486239]
kernel: [37818.486239] Pid: 29993, comm: httpd Not tainted 3.1.0-rc8
#1 Dell Inc. PowerEdge 1950/0DT097
kernel: [37818.486239] RIP: 0010:[<ffffffff8119c548>]
[<ffffffff8119c548>] fscache_objlist_add+0xa8/0xb0
kernel: [37818.486239] RSP: 0018:ffff88002233bb28  EFLAGS: 00010246
kernel: [37818.486239] RAX: ffff880040861380 RBX: ffff880040861380
RCX: ffff880040860fd0
kernel: [37818.486239] RDX: ffff880040861450 RSI: ffff880223eae000
RDI: ffffffff81c94b90
kernel: [37818.486239] RBP: ffff88002233bb38 R08: ffff880010e74680
R09: 0000000000000069
kernel: [37818.486239] R10: ffff880010e746e2 R11: ffff880010e746e1
R12: ffff8800408613c8
kernel: [37818.486239] R13: ffff88021ded9000 R14: ffff88021a8682d0
R15: ffff880040861380

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Linux-cachefs] 3.0.3 64-bit Crash running fscache/cachefilesd
  2011-10-07 10:42                   ` David Howells
  2011-10-08 16:32                     ` Mark Moseley
@ 2011-10-11 13:07                     ` David Howells
  2011-10-11 16:27                       ` Mark Moseley
                                         ` (2 more replies)
  1 sibling, 3 replies; 46+ messages in thread
From: David Howells @ 2011-10-11 13:07 UTC (permalink / raw)
  To: Mark Moseley
  Cc: dhowells, Linux filesystem caching discussion list, linux-kernel

Mark Moseley <moseleymark@gmail.com> wrote:

> [37818.485559] kernel BUG at fs/fscache/object-list.c:64!

Hmmm...  An object is being added to the object list twice.  It may indicate a
free before an object is finished with.  Do you have slab debugging turned on?

David

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Linux-cachefs] 3.0.3 64-bit Crash running fscache/cachefilesd
  2011-10-11 13:07                     ` David Howells
@ 2011-10-11 16:27                       ` Mark Moseley
  2011-10-12  9:26                       ` David Howells
  2011-10-12 10:05                       ` David Howells
  2 siblings, 0 replies; 46+ messages in thread
From: Mark Moseley @ 2011-10-11 16:27 UTC (permalink / raw)
  To: David Howells; +Cc: Linux filesystem caching discussion list, linux-kernel

On Tue, Oct 11, 2011 at 6:07 AM, David Howells <dhowells@redhat.com> wrote:
> Mark Moseley <moseleymark@gmail.com> wrote:
>
>> [37818.485559] kernel BUG at fs/fscache/object-list.c:64!
>
> Hmmm...  An object is being added to the object list twice.  It may indicate a
> free before an object is finished with.  Do you have slab debugging turned on?
>
> David
>

It's compiled in but not currently enabled. Should I do so? And
anything I should be looking for in it?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Linux-cachefs] 3.0.3 64-bit Crash running fscache/cachefilesd
  2011-10-11 13:07                     ` David Howells
  2011-10-11 16:27                       ` Mark Moseley
@ 2011-10-12  9:26                       ` David Howells
  2011-10-12 10:05                       ` David Howells
  2 siblings, 0 replies; 46+ messages in thread
From: David Howells @ 2011-10-12  9:26 UTC (permalink / raw)
  To: Mark Moseley
  Cc: dhowells, Linux filesystem caching discussion list, linux-kernel

Mark Moseley <moseleymark@gmail.com> wrote:

> >> [37818.485559] kernel BUG at fs/fscache/object-list.c:64!
> >
> > Hmmm...  An object is being added to the object list twice.  It may
> > indicate a free before an object is finished with.  Do you have slab
> > debugging turned on?
>
> It's compiled in but not currently enabled. Should I do so?

It can't be turned on or off dynamically.  If CONFIG_DEBUG_SLAB is enabled, then
it's switched on.

> And anything I should be looking for in it?

The BUG that you encountered would suggest a piece of memory is still in use as
far as parts of fscache is concerned - but it's just been produced by the slab
allocator again.  This is the sort of thing the slab debugger might catch.

David

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Linux-cachefs] 3.0.3 64-bit Crash running fscache/cachefilesd
  2011-10-11 13:07                     ` David Howells
  2011-10-11 16:27                       ` Mark Moseley
  2011-10-12  9:26                       ` David Howells
@ 2011-10-12 10:05                       ` David Howells
  2011-10-12 18:10                         ` Mark Moseley
  2 siblings, 1 reply; 46+ messages in thread
From: David Howells @ 2011-10-12 10:05 UTC (permalink / raw)
  To: Mark Moseley
  Cc: dhowells, Linux filesystem caching discussion list, linux-kernel

Mark Moseley <moseleymark@gmail.com> wrote:

> > Hmmm...  An object is being added to the object list twice.  It may
> > indicate a free before an object is finished with.  Do you have slab
> > debugging turned on?
> 
> It's compiled in but not currently enabled. Should I do so? And
> anything I should be looking for in it?

This raises another question: Which of CONFIG_SLAB/SLOB/SLUB are you actually
using?

David

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Linux-cachefs] 3.0.3 64-bit Crash running fscache/cachefilesd
  2011-10-12 10:05                       ` David Howells
@ 2011-10-12 18:10                         ` Mark Moseley
  2011-10-12 23:38                           ` Mark Moseley
  2011-10-13 15:21                           ` David Howells
  0 siblings, 2 replies; 46+ messages in thread
From: Mark Moseley @ 2011-10-12 18:10 UTC (permalink / raw)
  To: David Howells; +Cc: Linux filesystem caching discussion list, linux-kernel

On Wed, Oct 12, 2011 at 3:05 AM, David Howells <dhowells@redhat.com> wrote:
> Mark Moseley <moseleymark@gmail.com> wrote:
>
>> > Hmmm...  An object is being added to the object list twice.  It may
>> > indicate a free before an object is finished with.  Do you have slab
>> > debugging turned on?
>>
>> It's compiled in but not currently enabled. Should I do so? And
>> anything I should be looking for in it?
>
> This raises another question: Which of CONFIG_SLAB/SLOB/SLUB are you actually
> using?
>
> David
>

I had SLUB. I have CONFIG_SLUB_DEBUG but not CONFIG_SLUB_DEBUG_ON.

I've recompiled with SLAB and CONFIG_DEBUG_SLAB enabled though:

CONFIG_SLAB=y
CONFIG_SLABINFO=y
CONFIG_DEBUG_SLAB=y
CONFIG_DEBUG_SLAB_LEAK=y

Any flags anywhere that I need to set?

I've also cleared out the fscache cache. When I initially booted into
the 3.1.0-rc8 with SLAB, I was already at the cachefilesd.conf limit,
space-wise, and it would crash almost immediately (basically as soon
as load balancers started directing traffic to it again). I didn't
have console, so I don't have the full traceback for these ones, but a
little of the top actually made it to the logs:


[  117.654487] FS-Cache: Cache "mycache" added (type cachefiles)
[  117.654494] CacheFiles: File cache on sdb6 registered
[  184.943653] ------------[ cut here ]------------
[  184.952933] kernel BUG at fs/fscache/object-list.c:83!
[  184.960676] invalid opcode: 0000 [#1] SMP
[  184.960676] CPU 1
[  184.960676] Modules linked in: xfs ioatdma dca loop joydev evdev
dcdbas psmouse serio_raw pcspkr i5000_edac edac_core i5k_amb shpchp pc
i_hotplug sg sr_mod cdrom ehci_hcd uhci_hcd sd_mod crc_t10dif [last
unloaded: scsi_wait_scan]
[  184.960676]
[  184.960676] Pid: 638, comm: kworker/u:2 Not tainted 3.1.0-rc8 #1
Dell Inc. PowerEdge 1950/0DT097
[  184.960676] RIP: 0010:[<ffffffff8119d15c>]  [<ffffffff8119d15c>]
fscache_object_destroy+0x4c/0x50
[  184.960676] RSP: 0000:ffff880224cebd90  EFLAGS: 00010246
[  184.960676] RAX: 0000000000000011 RBX: ffff88021920a5e8 RCX: ffff8801d6b92a38
[  184.960676] RDX: 0000000000000010 RSI: 000000000000006b RDI: ffffffff81c94b10

This was the 2nd one, after I'd panicked it starting up cachefilesd a
couple of minutes earlier (which was also "kernel BUG at
fs/fscache/object-list.c:83"), so I figured I should clear the cache.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Linux-cachefs] 3.0.3 64-bit Crash running fscache/cachefilesd
  2011-10-12 18:10                         ` Mark Moseley
@ 2011-10-12 23:38                           ` Mark Moseley
  2011-10-13 15:21                           ` David Howells
  1 sibling, 0 replies; 46+ messages in thread
From: Mark Moseley @ 2011-10-12 23:38 UTC (permalink / raw)
  To: David Howells; +Cc: Linux filesystem caching discussion list, linux-kernel

On Wed, Oct 12, 2011 at 11:10 AM, Mark Moseley <moseleymark@gmail.com> wrote:
> On Wed, Oct 12, 2011 at 3:05 AM, David Howells <dhowells@redhat.com> wrote:
>> Mark Moseley <moseleymark@gmail.com> wrote:
>>
>>> > Hmmm...  An object is being added to the object list twice.  It may
>>> > indicate a free before an object is finished with.  Do you have slab
>>> > debugging turned on?
>>>
>>> It's compiled in but not currently enabled. Should I do so? And
>>> anything I should be looking for in it?
>>
>> This raises another question: Which of CONFIG_SLAB/SLOB/SLUB are you actually
>> using?
>>
>> David
>>
>
> I had SLUB. I have CONFIG_SLUB_DEBUG but not CONFIG_SLUB_DEBUG_ON.
>
> I've recompiled with SLAB and CONFIG_DEBUG_SLAB enabled though:
>
> CONFIG_SLAB=y
> CONFIG_SLABINFO=y
> CONFIG_DEBUG_SLAB=y
> CONFIG_DEBUG_SLAB_LEAK=y
>
> Any flags anywhere that I need to set?
>
> I've also cleared out the fscache cache. When I initially booted into
> the 3.1.0-rc8 with SLAB, I was already at the cachefilesd.conf limit,
> space-wise, and it would crash almost immediately (basically as soon
> as load balancers started directing traffic to it again). I didn't
> have console, so I don't have the full traceback for these ones, but a
> little of the top actually made it to the logs:
>
>
> [  117.654487] FS-Cache: Cache "mycache" added (type cachefiles)
> [  117.654494] CacheFiles: File cache on sdb6 registered
> [  184.943653] ------------[ cut here ]------------
> [  184.952933] kernel BUG at fs/fscache/object-list.c:83!
> [  184.960676] invalid opcode: 0000 [#1] SMP
> [  184.960676] CPU 1
> [  184.960676] Modules linked in: xfs ioatdma dca loop joydev evdev
> dcdbas psmouse serio_raw pcspkr i5000_edac edac_core i5k_amb shpchp pc
> i_hotplug sg sr_mod cdrom ehci_hcd uhci_hcd sd_mod crc_t10dif [last
> unloaded: scsi_wait_scan]
> [  184.960676]
> [  184.960676] Pid: 638, comm: kworker/u:2 Not tainted 3.1.0-rc8 #1
> Dell Inc. PowerEdge 1950/0DT097
> [  184.960676] RIP: 0010:[<ffffffff8119d15c>]  [<ffffffff8119d15c>]
> fscache_object_destroy+0x4c/0x50
> [  184.960676] RSP: 0000:ffff880224cebd90  EFLAGS: 00010246
> [  184.960676] RAX: 0000000000000011 RBX: ffff88021920a5e8 RCX: ffff8801d6b92a38
> [  184.960676] RDX: 0000000000000010 RSI: 000000000000006b RDI: ffffffff81c94b10
>
> This was the 2nd one, after I'd panicked it starting up cachefilesd a
> couple of minutes earlier (which was also "kernel BUG at
> fs/fscache/object-list.c:83"), so I figured I should clear the cache.
>


So on a cleared cache with SLAB, it took a while but this finally came
up. One interesting thing is that at some point, it logged this:

[13461.605871] [httpd ] <== __fscache_read_or_alloc_pages() = -ENOBUFS
[invalidating]

It was a while from when it logged that until when I happened to check
on the box again, but when I did (shortly before this traceback),
despite constant NFS activity, nothing in the fscache cache was
getting written out (i.e. the used bytes on the partition stopped
changing), and without any messages about withdrawing the cache or
anythin. The odds that everything that box was touching was in the
cache are incredibly slim, since it's web hosting. cachefilesd was
chugging along doing cull scans, so it didn't seem to be dead, but I
decided to restart it. The following traceback popped up within a
couple of minutes of cachefilesd restarting.


[20839.802118] kernel BUG at fs/fscache/object-list.c:83!
[20839.802733] invalid opcode: 0000 [#1] SMP
[20839.802733] CPU 0
[20839.802733] Modules linked in: xfs ioatdma dca loop joydev evdev
psmouse i5000_edac serio_raw pcspkr dcdbas shpchp edac_core
pci_hotplug i5k_amb sg sr_mod cdrom ehci_hcd ]
[20839.802733]
[20839.802733] Pid: 21371, comm: kworker/u:2 Not tainted 3.1.0-rc8 #1
Dell Inc. PowerEdge 1950/0DT097
[20839.802733] RIP: 0010:[<ffffffff8119d15c>]  [<ffffffff8119d15c>]
fscache_object_destroy+0x4c/0x50
[20839.802733] RSP: 0018:ffff8800a0d3bd90  EFLAGS: 00010246
[20839.802733] RAX: 0000000000000000 RBX: ffff880137e39a48 RCX: 00000000ffffffff
[20839.802733] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffffffff81c94b10
[20839.802733] RBP: ffff8800a0d3bda0 R08: ffff88022fc11640 R09: 0000000000000000
[20839.802733] R10: 0000000000000400 R11: 0000000000000000 R12: ffff880137e39a48
[20839.802733] R13: 0000000000000001 R14: ffff880137e39a48 R15: ffff880137e39a68
[20839.802733] FS:  0000000000000000(0000) GS:ffff88022fc00000(0000)
knlGS:0000000000000000
[20839.802733] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[20839.802733] CR2: ffffffffff600400 CR3: 00000001d5845000 CR4: 00000000000006f0
[20839.802733] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[20839.802733] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[20839.802733] Process kworker/u:2 (pid: 21371, threadinfo
ffff8800a0d3a000, task ffff88006ba101c0)
[20839.802733] Stack:
[20839.802733]  ffff8800a0d3bdc0 ffff88007198b720 ffff8800a0d3bdc0
ffffffff81287c85
[20839.802733]  ffff8800578f4c18 ffff880137e39ac8 ffff8800a0d3be10
ffffffff811974fd
[20839.802733]  ffff880137e39a80 ffff880010486178 ffff8800578f4c20
0000000000000000
[20839.802733] Call Trace:
[20839.802733]  [<ffffffff81287c85>] cachefiles_put_object+0xf5/0x360
[20839.802733]  [<ffffffff811974fd>] fscache_object_work_func+0x1cd/0xe90
[20839.802733]  [<ffffffff810661d4>] process_one_work+0x164/0x440
[20839.802733]  [<ffffffff81197330>] ? fscache_enqueue_dependents+0x100/0x100
[20839.802733]  [<ffffffff810669ec>] worker_thread+0x32c/0x430
[20839.802733]  [<ffffffff810666c0>] ? manage_workers+0x210/0x210
[20839.802733]  [<ffffffff8106d27e>] kthread+0x9e/0xb0
[20839.802733]  [<ffffffff8163f7b4>] kernel_thread_helper+0x4/0x10
[20839.802733]  [<ffffffff8163ce4a>] ? retint_restore_args+0xe/0xe
[20839.802733]  [<ffffffff8106d1e0>] ? kthread_worker_fn+0x190/0x190
[20839.802733]  [<ffffffff8163f7b0>] ? gs_change+0xb/0xb
[20839.802733] Code: 6c cb 00 00 74 25 48 8d bb d0 00 00 00 48 c7 c6
f8 3d e5 81 e8 06 ac 13 00 f0 81 05 bb 79 af 00 00 00 10 00 48 83 c4
08 5b c9 c3 <0f> 0b eb fe 55 48 89
[20839.802733] RIP  [<ffffffff8119d15c>] fscache_object_destroy+0x4c/0x50
[20839.802733]  RSP <ffff8800a0d3bd90>
[20840.320257] ---[ end trace fdcbf6423d22291c ]---

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Linux-cachefs] 3.0.3 64-bit Crash running fscache/cachefilesd
  2011-10-12 18:10                         ` Mark Moseley
  2011-10-12 23:38                           ` Mark Moseley
@ 2011-10-13 15:21                           ` David Howells
  2011-10-13 20:48                             ` Mark Moseley
  2011-10-14  9:22                             ` David Howells
  1 sibling, 2 replies; 46+ messages in thread
From: David Howells @ 2011-10-13 15:21 UTC (permalink / raw)
  To: Mark Moseley
  Cc: dhowells, Linux filesystem caching discussion list, linux-kernel

Mark Moseley <moseleymark@gmail.com> wrote:

> So on a cleared cache with SLAB, it took a while but this finally came
> up. One interesting thing is that at some point, it logged this:
> 
> [13461.605871] [httpd ] <== __fscache_read_or_alloc_pages() = -ENOBUFS
> [invalidating]

That's okay.  Basically, a read-from-cache operation was rejected because the
cache object was in the early phase of being invalidated.  I kept it simple
here - the read might complete next time it is tried, but it's just a cache so
that shouldn't matter.

> It was a while from when it logged that until when I happened to check
> on the box again, but when I did (shortly before this traceback),
> despite constant NFS activity, nothing in the fscache cache was
> getting written out (i.e. the used bytes on the partition stopped
> changing), and without any messages about withdrawing the cache or
> anythin.

Did you look at /proc/fs/fscache/stats at all?

> [20839.802118] kernel BUG at fs/fscache/object-list.c:83!
> [20839.802733] invalid opcode: 0000 [#1] SMP

That fits with the previous BUG elsewhere in object-list.c.  It sounds like
there's a refcounting problem somewhere.

David

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Linux-cachefs] 3.0.3 64-bit Crash running fscache/cachefilesd
  2011-10-13 15:21                           ` David Howells
@ 2011-10-13 20:48                             ` Mark Moseley
  2011-10-14  9:22                             ` David Howells
  1 sibling, 0 replies; 46+ messages in thread
From: Mark Moseley @ 2011-10-13 20:48 UTC (permalink / raw)
  To: David Howells; +Cc: Linux filesystem caching discussion list, linux-kernel

On Thu, Oct 13, 2011 at 8:21 AM, David Howells <dhowells@redhat.com> wrote:
> Mark Moseley <moseleymark@gmail.com> wrote:
>
>> So on a cleared cache with SLAB, it took a while but this finally came
>> up. One interesting thing is that at some point, it logged this:
>>
>> [13461.605871] [httpd ] <== __fscache_read_or_alloc_pages() = -ENOBUFS
>> [invalidating]
>
> That's okay.  Basically, a read-from-cache operation was rejected because the
> cache object was in the early phase of being invalidated.  I kept it simple
> here - the read might complete next time it is tried, but it's just a cache so
> that shouldn't matter.

Ok, noted


>> It was a while from when it logged that until when I happened to check
>> on the box again, but when I did (shortly before this traceback),
>> despite constant NFS activity, nothing in the fscache cache was
>> getting written out (i.e. the used bytes on the partition stopped
>> changing), and without any messages about withdrawing the cache or
>> anythin.
>
> Did you look at /proc/fs/fscache/stats at all?

I didn't but I can repeat it. Which of the stats in
/proc/fs/fscache/stats would be best to track?


>> [20839.802118] kernel BUG at fs/fscache/object-list.c:83!
>> [20839.802733] invalid opcode: 0000 [#1] SMP
>
> That fits with the previous BUG elsewhere in object-list.c.  It sounds like
> there's a refcounting problem somewhere.

Any sys or proc settings I should turn on to track that?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Linux-cachefs] 3.0.3 64-bit Crash running fscache/cachefilesd
  2011-10-13 15:21                           ` David Howells
  2011-10-13 20:48                             ` Mark Moseley
@ 2011-10-14  9:22                             ` David Howells
  2011-10-14 23:25                               ` Mark Moseley
                                                 ` (2 more replies)
  1 sibling, 3 replies; 46+ messages in thread
From: David Howells @ 2011-10-14  9:22 UTC (permalink / raw)
  To: Mark Moseley
  Cc: dhowells, Linux filesystem caching discussion list, linux-kernel

Mark Moseley <moseleymark@gmail.com> wrote:

> > Did you look at /proc/fs/fscache/stats at all?
> 
> I didn't but I can repeat it. Which of the stats in
> /proc/fs/fscache/stats would be best to track?

If you could get two snapshots a couple of minutes apart, that'd be useful.
What I'm interested in is what stops changing and anything in the CacheOp list
at the bottom that becomes wedged on a non-zero value.

> >> [20839.802118] kernel BUG at fs/fscache/object-list.c:83!
> >> [20839.802733] invalid opcode: 0000 [#1] SMP
> >
> > That fits with the previous BUG elsewhere in object-list.c.  It sounds like
> > there's a refcounting problem somewhere.
> 
> Any sys or proc settings I should turn on to track that?

Not really.  However, if you could apply the attached patch, it will move the
object list handling to next to where the object allocation and freeing is
done.  I'm curious to see if this makes a difference.

The 'object list' is an RB tree keyed on the address of an object in RAM - so
if an object is already there it must have been double-added somehow or must
not have been removed.

David
---

 fs/cachefiles/interface.c     |    1 +
 fs/fscache/cache.c            |    1 -
 fs/fscache/cookie.c           |    1 -
 fs/fscache/object-list.c      |    1 +
 include/linux/fscache-cache.h |   19 +++++++++++--------
 5 files changed, 13 insertions(+), 10 deletions(-)


diff --git a/fs/cachefiles/interface.c b/fs/cachefiles/interface.c
index ef5c02d..3dcecdf 100644
--- a/fs/cachefiles/interface.c
+++ b/fs/cachefiles/interface.c
@@ -104,6 +104,7 @@ nomem_key:
 	kfree(buffer);
 nomem_buffer:
 	BUG_ON(test_bit(CACHEFILES_OBJECT_ACTIVE, &object->flags));
+	fscache_object_destroy(&object->fscache);
 	kmem_cache_free(cachefiles_object_jar, object);
 	fscache_object_destroyed(&cache->cache);
 nomem_object:
diff --git a/fs/fscache/cache.c b/fs/fscache/cache.c
index b52aed1..98bca68 100644
--- a/fs/fscache/cache.c
+++ b/fs/fscache/cache.c
@@ -263,7 +263,6 @@ int fscache_add_cache(struct fscache_cache *cache,
 	spin_lock(&cache->object_list_lock);
 	list_add_tail(&ifsdef->cache_link, &cache->object_list);
 	spin_unlock(&cache->object_list_lock);
-	fscache_objlist_add(ifsdef);
 
 	/* add the cache's netfs definition index object to the top level index
 	 * cookie as a known backing object */
diff --git a/fs/fscache/cookie.c b/fs/fscache/cookie.c
index 8dcb114..47d8cde 100644
--- a/fs/fscache/cookie.c
+++ b/fs/fscache/cookie.c
@@ -360,7 +360,6 @@ static int fscache_attach_object(struct fscache_cookie *cookie,
 	atomic_inc(&cookie->usage);
 	hlist_add_head(&object->cookie_link, &cookie->backing_objects);
 
-	fscache_objlist_add(object);
 	ret = 0;
 
 cant_attach_object:
diff --git a/fs/fscache/object-list.c b/fs/fscache/object-list.c
index f27c89d..f8fbb32 100644
--- a/fs/fscache/object-list.c
+++ b/fs/fscache/object-list.c
@@ -69,6 +69,7 @@ void fscache_objlist_add(struct fscache_object *obj)
 
 	write_unlock(&fscache_object_list_lock);
 }
+EXPORT_SYMBOL(fscache_objlist_add);
 
 /**
  * fscache_object_destroy - Note that a cache object is about to be destroyed
diff --git a/include/linux/fscache-cache.h b/include/linux/fscache-cache.h
index 633b65d..f657c0a 100644
--- a/include/linux/fscache-cache.h
+++ b/include/linux/fscache-cache.h
@@ -440,6 +440,14 @@ extern const char *fscache_object_states[];
 
 extern void fscache_object_work_func(struct work_struct *work);
 
+#ifdef CONFIG_FSCACHE_OBJECT_LIST
+extern void fscache_objlist_add(struct fscache_object *obj);
+extern void fscache_object_destroy(struct fscache_object *object);
+#else
+#define fscache_object_destroy(object) do {} while(0)
+#define fscache_objlist_add(object) do {} while(0)
+#endif
+
 /**
  * fscache_object_init - Initialise a cache object description
  * @object: Object description
@@ -454,8 +462,6 @@ void fscache_object_init(struct fscache_object *object,
 			 struct fscache_cookie *cookie,
 			 struct fscache_cache *cache)
 {
-	atomic_inc(&cache->object_count);
-
 	object->state = FSCACHE_OBJECT_INIT;
 	spin_lock_init(&object->lock);
 	INIT_LIST_HEAD(&object->cache_link);
@@ -473,17 +479,14 @@ void fscache_object_init(struct fscache_object *object,
 	object->cache = cache;
 	object->cookie = cookie;
 	object->parent = NULL;
+
+	atomic_inc(&cache->object_count);
+	fscache_objlist_add(object);
 }
 
 extern void fscache_object_lookup_negative(struct fscache_object *object);
 extern void fscache_obtained_object(struct fscache_object *object);
 
-#ifdef CONFIG_FSCACHE_OBJECT_LIST
-extern void fscache_object_destroy(struct fscache_object *object);
-#else
-#define fscache_object_destroy(object) do {} while(0)
-#endif
-
 /**
  * fscache_object_destroyed - Note destruction of an object in a cache
  * @cache: The cache from which the object came


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [Linux-cachefs] 3.0.3 64-bit Crash running fscache/cachefilesd
  2011-10-14  9:22                             ` David Howells
@ 2011-10-14 23:25                               ` Mark Moseley
  2011-10-17 10:39                               ` David Howells
  2011-10-19 12:25                               ` David Howells
  2 siblings, 0 replies; 46+ messages in thread
From: Mark Moseley @ 2011-10-14 23:25 UTC (permalink / raw)
  To: David Howells; +Cc: Linux filesystem caching discussion list, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 9043 bytes --]

On Fri, Oct 14, 2011 at 2:22 AM, David Howells <dhowells@redhat.com> wrote:
> Mark Moseley <moseleymark@gmail.com> wrote:
>
>> > Did you look at /proc/fs/fscache/stats at all?
>>
>> I didn't but I can repeat it. Which of the stats in
>> /proc/fs/fscache/stats would be best to track?
>
> If you could get two snapshots a couple of minutes apart, that'd be useful.
> What I'm interested in is what stops changing and anything in the CacheOp list
> at the bottom that becomes wedged on a non-zero value.

Patch is applied. I'm attaching a file with stats and a df of the
fscache partition. I cleared the cache and waited till it got
reasonably full before starting this capture. No crash yet, btw.

You can see that for the past couple of hours the byte counts on the
partition have only fluctuated a handful of Kb.

Incidentally, my cachefilesd.conf (I don't think I've sent it before
in this thread):

# cat /etc/cachefilesd.conf
dir /var/cache/fscache
tag mycache
brun  40%
bcull 30%
bstop 20%
frun  10%
fcull 7%
fstop 3%
culltable 20


# cachefilesd -v
cachefilesd version 0.10.1


Presumably it gets to bcull and stops storing but nothing's getting
pruned. I can see cachefilesd is in constant activity right now.
Looking at strace, maybe it can't find anything to prune because it
thinks it's all active. I'm seeing a constant loop that looks similar
to this:

read(3, "cull=1 frun=2d82a fcull=1fdb7 fst"..., 4096) = 78
fchdir(11)                              = 0
newfstatat(11, "@9c", {st_mode=02, st_size=17592186044416, ...}, 0) = 0
read(3, "cull=1 frun=2d82a fcull=1fdb7 fst"..., 4096) = 78
openat(11, "@9c", O_RDONLY|O_DIRECTORY) = 12
fstat(12, {st_mode=S_IFDIR, st_size=4096, ...}) = 0
fcntl(12, F_GETFL)                      = 0x18000 (flags
O_RDONLY|O_LARGEFILE|O_DIRECTORY)
fcntl(12, F_SETFD, FD_CLOEXEC)          = 0
fchdir(12)                              = 0
getdents(12, /* 23 entries */, 4096)    = 1896
newfstatat(12, "EI0001000Pgb0020000gvF0l0uU2L1QQG0080000001000IVThl4g000000000000",
{st_mode=01, st_size=5566277615616, ...}, 0) = 0
write(3, "inuse EI0001000Pgb0020000gvF0l0uU"..., 71) = -1 EBUSY
(Device or resource busy)
newfstatat(12, "EI0001000Pgb0020000gUSTv0lC2p9QQG0080000001000IVThl4g000000000000",
{st_mode=01, st_size=87170656239616, ...}, 0) = 0
write(3, "inuse EI0001000Pgb0020000gUSTv0lC"..., 71) = -1 EBUSY
(Device or resource busy)
newfstatat(12, "EI0001000Pgb00200000PHNi0Q1O11QQG0080000001000IVThl4g000000000000",
{st_mode=01, st_size=24099061497856, ...}, 0) = 0
write(3, "inuse EI0001000Pgb00200000PHNi0Q1"..., 71) = -1 EBUSY
(Device or resource busy)
newfstatat(12, "EI0001000Pgb0020000Mfm2f05I6O3QQG0080000001000IVThl4g000000000000",
{st_mode=01, st_size=36807869726720, ...}, 0) = 0
write(3, "inuse EI0001000Pgb0020000Mfm2f05I"..., 71) = -1 EBUSY
(Device or resource busy)
newfstatat(12, "EI0001000Pgb0020000MoS5j0667-2QQG0080000001000IVThl4g000000000000",
{st_mode=01, st_size=10642928959488, ...}, 0) = 0
write(3, "inuse EI0001000Pgb0020000MoS5j066"..., 71) = -1 EBUSY
(Device or resource busy)
newfstatat(12, "EI0001000Pgb0020000gaRds0iBnFKQQG0080000001000IVThl4g000000000000",
{st_mode=01, st_size=7314329305088, ...}, 0) = 0
write(3, "inuse EI0001000Pgb0020000gaRds0iB"..., 71) = -1 EBUSY
(Device or resource busy)
newfstatat(12, "EI0001000Pgb00200000794b0y2CYSQQG0080000001000IVThl4g000000000000",
{st_mode=01, st_size=22020297326592, ...}, 0) = 0
write(3, "inuse EI0001000Pgb00200000794b0y2"..., 71) = -1 EBUSY
(Device or resource busy)
newfstatat(12, "EI0001000Pgb0020000MGKZd0sai09RQG0080000001000IVThl4g000000000000",
{st_mode=01, st_size=210453397504, ...}, 0) = 0
write(3, "inuse EI0001000Pgb0020000MGKZd0sa"..., 71) = -1 EBUSY
(Device or resource busy)
newfstatat(12, "EI0001000Pgb00200000G07d0bH5AiQQG0080000001000IVThl4g000000000000",
{st_mode=01, st_size=16892106375168, ...}, 0) = 0
write(3, "inuse EI0001000Pgb00200000G07d0bH"..., 71) = -1 EBUSY
(Device or resource busy)
newfstatat(12, "EI0001000Pgb00200000ytn40zr8X1QQG0080000001000IVThl4g000000000000",
{st_mode=01, st_size=129596343189504, ...}, 0) = 0




>> >> [20839.802118] kernel BUG at fs/fscache/object-list.c:83!
>> >> [20839.802733] invalid opcode: 0000 [#1] SMP
>> >
>> > That fits with the previous BUG elsewhere in object-list.c.  It sounds like
>> > there's a refcounting problem somewhere.
>>
>> Any sys or proc settings I should turn on to track that?
>
> Not really.  However, if you could apply the attached patch, it will move the
> object list handling to next to where the object allocation and freeing is
> done.  I'm curious to see if this makes a difference.
>
> The 'object list' is an RB tree keyed on the address of an object in RAM - so
> if an object is already there it must have been double-added somehow or must
> not have been removed.
>
> David
> ---
>
>  fs/cachefiles/interface.c     |    1 +
>  fs/fscache/cache.c            |    1 -
>  fs/fscache/cookie.c           |    1 -
>  fs/fscache/object-list.c      |    1 +
>  include/linux/fscache-cache.h |   19 +++++++++++--------
>  5 files changed, 13 insertions(+), 10 deletions(-)
>
>
> diff --git a/fs/cachefiles/interface.c b/fs/cachefiles/interface.c
> index ef5c02d..3dcecdf 100644
> --- a/fs/cachefiles/interface.c
> +++ b/fs/cachefiles/interface.c
> @@ -104,6 +104,7 @@ nomem_key:
>        kfree(buffer);
>  nomem_buffer:
>        BUG_ON(test_bit(CACHEFILES_OBJECT_ACTIVE, &object->flags));
> +       fscache_object_destroy(&object->fscache);
>        kmem_cache_free(cachefiles_object_jar, object);
>        fscache_object_destroyed(&cache->cache);
>  nomem_object:
> diff --git a/fs/fscache/cache.c b/fs/fscache/cache.c
> index b52aed1..98bca68 100644
> --- a/fs/fscache/cache.c
> +++ b/fs/fscache/cache.c
> @@ -263,7 +263,6 @@ int fscache_add_cache(struct fscache_cache *cache,
>        spin_lock(&cache->object_list_lock);
>        list_add_tail(&ifsdef->cache_link, &cache->object_list);
>        spin_unlock(&cache->object_list_lock);
> -       fscache_objlist_add(ifsdef);
>
>        /* add the cache's netfs definition index object to the top level index
>         * cookie as a known backing object */
> diff --git a/fs/fscache/cookie.c b/fs/fscache/cookie.c
> index 8dcb114..47d8cde 100644
> --- a/fs/fscache/cookie.c
> +++ b/fs/fscache/cookie.c
> @@ -360,7 +360,6 @@ static int fscache_attach_object(struct fscache_cookie *cookie,
>        atomic_inc(&cookie->usage);
>        hlist_add_head(&object->cookie_link, &cookie->backing_objects);
>
> -       fscache_objlist_add(object);
>        ret = 0;
>
>  cant_attach_object:
> diff --git a/fs/fscache/object-list.c b/fs/fscache/object-list.c
> index f27c89d..f8fbb32 100644
> --- a/fs/fscache/object-list.c
> +++ b/fs/fscache/object-list.c
> @@ -69,6 +69,7 @@ void fscache_objlist_add(struct fscache_object *obj)
>
>        write_unlock(&fscache_object_list_lock);
>  }
> +EXPORT_SYMBOL(fscache_objlist_add);
>
>  /**
>  * fscache_object_destroy - Note that a cache object is about to be destroyed
> diff --git a/include/linux/fscache-cache.h b/include/linux/fscache-cache.h
> index 633b65d..f657c0a 100644
> --- a/include/linux/fscache-cache.h
> +++ b/include/linux/fscache-cache.h
> @@ -440,6 +440,14 @@ extern const char *fscache_object_states[];
>
>  extern void fscache_object_work_func(struct work_struct *work);
>
> +#ifdef CONFIG_FSCACHE_OBJECT_LIST
> +extern void fscache_objlist_add(struct fscache_object *obj);
> +extern void fscache_object_destroy(struct fscache_object *object);
> +#else
> +#define fscache_object_destroy(object) do {} while(0)
> +#define fscache_objlist_add(object) do {} while(0)
> +#endif
> +
>  /**
>  * fscache_object_init - Initialise a cache object description
>  * @object: Object description
> @@ -454,8 +462,6 @@ void fscache_object_init(struct fscache_object *object,
>                         struct fscache_cookie *cookie,
>                         struct fscache_cache *cache)
>  {
> -       atomic_inc(&cache->object_count);
> -
>        object->state = FSCACHE_OBJECT_INIT;
>        spin_lock_init(&object->lock);
>        INIT_LIST_HEAD(&object->cache_link);
> @@ -473,17 +479,14 @@ void fscache_object_init(struct fscache_object *object,
>        object->cache = cache;
>        object->cookie = cookie;
>        object->parent = NULL;
> +
> +       atomic_inc(&cache->object_count);
> +       fscache_objlist_add(object);
>  }
>
>  extern void fscache_object_lookup_negative(struct fscache_object *object);
>  extern void fscache_obtained_object(struct fscache_object *object);
>
> -#ifdef CONFIG_FSCACHE_OBJECT_LIST
> -extern void fscache_object_destroy(struct fscache_object *object);
> -#else
> -#define fscache_object_destroy(object) do {} while(0)
> -#endif
> -
>  /**
>  * fscache_object_destroyed - Note destruction of an object in a cache
>  * @cache: The cache from which the object came
>
>

[-- Attachment #2: fscache.stats.gz --]
[-- Type: application/x-gzip, Size: 21977 bytes --]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Linux-cachefs] 3.0.3 64-bit Crash running fscache/cachefilesd
  2011-10-14  9:22                             ` David Howells
  2011-10-14 23:25                               ` Mark Moseley
@ 2011-10-17 10:39                               ` David Howells
  2011-10-19 12:25                               ` David Howells
  2 siblings, 0 replies; 46+ messages in thread
From: David Howells @ 2011-10-17 10:39 UTC (permalink / raw)
  To: Mark Moseley
  Cc: dhowells, Linux filesystem caching discussion list, linux-kernel

Mark Moseley <moseleymark@gmail.com> wrote:

> Presumably it gets to bcull and stops storing but nothing's getting
> pruned. I can see cachefilesd is in constant activity right now.
> Looking at strace, maybe it can't find anything to prune because it
> thinks it's all active. I'm seeing a constant loop that looks similar
> to this:

Yeah...  The culling algorithm needs overhauling.  I have most of a patch to
do that, but I wanted to fix the bugs in the existing code before throwing a
new set over the top.

David

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Linux-cachefs] 3.0.3 64-bit Crash running fscache/cachefilesd
  2011-10-14  9:22                             ` David Howells
  2011-10-14 23:25                               ` Mark Moseley
  2011-10-17 10:39                               ` David Howells
@ 2011-10-19 12:25                               ` David Howells
  2011-10-19 23:15                                 ` Mark Moseley
                                                   ` (2 more replies)
  2 siblings, 3 replies; 46+ messages in thread
From: David Howells @ 2011-10-19 12:25 UTC (permalink / raw)
  To: Mark Moseley
  Cc: dhowells, Linux filesystem caching discussion list, linux-kernel

Mark Moseley <moseleymark@gmail.com> wrote:

> Presumably it gets to bcull and stops storing but nothing's getting pruned.

It is possible that all the objects are pinned by inodes just sitting there in
the client's inode cache doing nothing.  Thus fscache thinks they're in use.

Any more oopses from fscache or cachefiles?

David

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Linux-cachefs] 3.0.3 64-bit Crash running fscache/cachefilesd
  2011-10-19 12:25                               ` David Howells
@ 2011-10-19 23:15                                 ` Mark Moseley
  2011-10-20  8:46                                 ` David Howells
  2011-10-20  9:03                                 ` David Howells
  2 siblings, 0 replies; 46+ messages in thread
From: Mark Moseley @ 2011-10-19 23:15 UTC (permalink / raw)
  To: David Howells; +Cc: Linux filesystem caching discussion list, linux-kernel

On Wed, Oct 19, 2011 at 5:25 AM, David Howells <dhowells@redhat.com> wrote:
> Mark Moseley <moseleymark@gmail.com> wrote:
>
>> Presumably it gets to bcull and stops storing but nothing's getting pruned.
>
> It is possible that all the objects are pinned by inodes just sitting there in
> the client's inode cache doing nothing.  Thus fscache thinks they're in use.

I wasn't able to run 3.1.0-rc8 on that box over the weekend. I fired
it up this morning though. I left the existing cache in place. For a
few hours, it was culling (Used% on that partition got as low as 60%
at one point, i.e. 'brun'), but now it's back to stuck at 'bcull'.

Is there anything I can do to verify whether the objects are indeed
pinned? This is a pretty busy box. The nfs inode cache is quite large
(from slabtop):

900471 900377  99%    1.02K 300157        3   1200628K nfs_inode_cache

One slightly interesting thing, unrelated to fscache: This box is a
part of a pool of servers, serving the same web workloads. Another box
in this same pool is running 3.0.4, up for about 23 days (vs 6 hrs),
and the nfs_inode_cache is approximately 1/4 of the 3.1.0-rc8's,
size-wise, 1/3 #ofobjects-wise; likewise dentry in a 3.0.4 box with a
much longer uptime is about 1/9 the size (200k objs vs 1.8mil objects,
45megs vs 400megs) as the 3.1.0-rc8 box. Dunno if that's the result of
VM improvements or a symptom of something leaking :)

I don't see any other huge disparities, so I'm hoping it's the former.

Out of curiosity, did the dump of /proc/fs/fscache/stats show anything
interesting?


> Any more oopses from fscache or cachefiles?

The other day when I ran it, I didn't see any oopses after 16 or so
hours. I'll see if I can run it overnight and report back.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Linux-cachefs] 3.0.3 64-bit Crash running fscache/cachefilesd
  2011-10-19 12:25                               ` David Howells
  2011-10-19 23:15                                 ` Mark Moseley
@ 2011-10-20  8:46                                 ` David Howells
  2011-10-20 19:37                                   ` Mark Moseley
  2011-10-20  9:03                                 ` David Howells
  2 siblings, 1 reply; 46+ messages in thread
From: David Howells @ 2011-10-20  8:46 UTC (permalink / raw)
  To: Mark Moseley
  Cc: dhowells, Linux filesystem caching discussion list, linux-kernel

Mark Moseley <moseleymark@gmail.com> wrote:

> I left the existing cache in place. For a few hours, it was culling

That will be whilst the inode cache was filling, I suspect.

> Is there anything I can do to verify whether the objects are indeed
> pinned? This is a pretty busy box. The nfs inode cache is quite large
> (from slabtop):

Hmmm...  Can you get me a dump of /proc/fs/fscache/stats to look at?

The "Objects:" and "Relinqs:" lines are the most interesting.  On the first
line avl=N shows the number of objects that are in the "available" state -
ie. are live for caching.

Also, can you do:

	df -i /path/to/cache/partition

to get the number of inodes available and used.

David

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Linux-cachefs] 3.0.3 64-bit Crash running fscache/cachefilesd
  2011-10-19 12:25                               ` David Howells
  2011-10-19 23:15                                 ` Mark Moseley
  2011-10-20  8:46                                 ` David Howells
@ 2011-10-20  9:03                                 ` David Howells
  2011-10-20 19:29                                   ` Mark Moseley
  2011-10-20 23:05                                   ` David Howells
  2 siblings, 2 replies; 46+ messages in thread
From: David Howells @ 2011-10-20  9:03 UTC (permalink / raw)
  To: Mark Moseley
  Cc: dhowells, Linux filesystem caching discussion list, linux-kernel

Mark Moseley <moseleymark@gmail.com> wrote:

> Out of curiosity, did the dump of /proc/fs/fscache/stats show anything
> interesting?

Ah...  I missed the attachment.

Looking at the number of pages currently marked (the difference between the
following two numbers):

	Pages  : mrk=3438716 unc=3223887
	...
	Pages  : mrk=7660986 unc=7608076
	Pages  : mrk=7668510 unc=7618591

That isn't very high.  214829 at the beginning, dropping to 49919 at the end.
I suspect this means that a lot of NFS inodes now exist that aren't now cached
(the cache is under no requirement to actually cache anything if it feels it
lacks the resources just to prevent the system from grinding to a halt).

Was the last item in the list just before a crash?  I presume not from your
comments.

> One slightly interesting thing, unrelated to fscache: This box is a
> part of a pool of servers, serving the same web workloads. Another box
> in this same pool is running 3.0.4, up for about 23 days (vs 6 hrs),
> and the nfs_inode_cache is approximately 1/4 of the 3.1.0-rc8's,
> size-wise, 1/3 #ofobjects-wise; likewise dentry in a 3.0.4 box with a
> much longer uptime is about 1/9 the size (200k objs vs 1.8mil objects,
> 45megs vs 400megs) as the 3.1.0-rc8 box. Dunno if that's the result of
> VM improvements or a symptom of something leaking :)

It also depends on what the load consists of.  For instance someone running a
lot of find commands would cause the server to skew in favour of inodes over
data, but someone reading/writing big files would skew it the other way.

Do I take it the 3.0.4 box is not running fscache, but the 3.1.0-rc8 box is?

David


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Linux-cachefs] 3.0.3 64-bit Crash running fscache/cachefilesd
  2011-10-20  9:03                                 ` David Howells
@ 2011-10-20 19:29                                   ` Mark Moseley
  2011-10-20 23:05                                   ` David Howells
  1 sibling, 0 replies; 46+ messages in thread
From: Mark Moseley @ 2011-10-20 19:29 UTC (permalink / raw)
  To: David Howells; +Cc: Linux filesystem caching discussion list, linux-kernel

On Thu, Oct 20, 2011 at 2:03 AM, David Howells <dhowells@redhat.com> wrote:
> Mark Moseley <moseleymark@gmail.com> wrote:
>
>> Out of curiosity, did the dump of /proc/fs/fscache/stats show anything
>> interesting?
>
> Ah...  I missed the attachment.
>
> Looking at the number of pages currently marked (the difference between the
> following two numbers):
>
>        Pages  : mrk=3438716 unc=3223887
>        ...
>        Pages  : mrk=7660986 unc=7608076
>        Pages  : mrk=7668510 unc=7618591
>
> That isn't very high.  214829 at the beginning, dropping to 49919 at the end.
> I suspect this means that a lot of NFS inodes now exist that aren't now cached
> (the cache is under no requirement to actually cache anything if it feels it
> lacks the resources just to prevent the system from grinding to a halt).
>
> Was the last item in the list just before a crash?  I presume not from your
> comments.

Nope, it wasn't. I had to reboot it back into the previous kernel for
the weekend. I got a couple of oops though in the past 12 hours. I
unfortunately hadn't fired up that dump again though (I have now and
will send a new dump when I get another crash)

Here's the first:

[67666.379861] ------------[ cut here ]------------
[67666.379991] FS-Cache: Asserting on ReadOrAllocPages operation
[67666.379994]
[67666.379995] FS-Cache: Assertion failed
[67666.379997] 3 == 5 is false
[67666.389761] kernel BUG at fs/fscache/operation.c:426!
[67666.418480] invalid opcode: 0000 [#1] SMP
[67666.418480] CPU 6
[67666.418480] Modules linked in: xfs ioatdma dca loop joydev evdev
i5000_edac psmouse dcdbas edac_core pcspkr serio_raw shpchp
pci_hotplug i5k_amb sg sr_mod cdrom ehci_hcd ]
[67666.463678]
[67666.463678] Pid: 29497, comm: kworker/u:1 Not tainted 3.1.0-rc8 #1
Dell Inc. PowerEdge 1950/0DT097
[67666.463678] RIP: 0010:[<ffffffff81198832>]  [<ffffffff81198832>]
fscache_put_operation+0x332/0x360
[67666.463678] RSP: 0018:ffff880169e89dd0  EFLAGS: 00010286
[67666.463678] RAX: 0000000000000025 RBX: ffff8801ab673ae4 RCX: ffffffff81dfb040
[67666.463678] RDX: 00000000ffffffff RSI: 0000000000000086 RDI: ffffffff81dfaf30
[67666.463678] RBP: ffff880169e89df0 R08: 0000000000000006 R09: 0000000000000000
[67666.463678] R10: 0000000000000000 R11: 0000000000000003 R12: ffff8801ab673aa0
[67666.463678] R13: ffff8801dfe9d340 R14: ffffffff81e1d780 R15: ffff8802271f2305
[67666.463678] FS:  0000000000000000(0000) GS:ffff88022fd80000(0000)
knlGS:0000000000000000
[67666.463678] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[67666.463678] CR2: ffffffffff600400 CR3: 00000001ce38f000 CR4: 00000000000006e0
[67666.463678] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[67666.463678] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[67666.463678] Process kworker/u:1 (pid: 29497, threadinfo
ffff880169e88000, task ffff880048418040)
[67666.463678] Stack:
[67666.463678]  ffff8801ab673aa0 ffff8801ab673aa0 ffff8801ab673aa0
ffff8801dfe9d340
[67666.463678]  ffff880169e89e10 ffffffff8119889b ffffffff81e1d780
0000000000000000
[67666.463678]  ffff880169e89e60 ffffffff810661d4 0000000000000000
ffff8802271f2300
[67666.463678] Call Trace:
[67666.463678]  [<ffffffff8119889b>] fscache_op_work_func+0x3b/0xd0
[67666.463678]  [<ffffffff810661d4>] process_one_work+0x164/0x440
[67666.463678]  [<ffffffff81198860>] ? fscache_put_operation+0x360/0x360
[67666.463678]  [<ffffffff810669ec>] worker_thread+0x32c/0x430
[67666.463678]  [<ffffffff810666c0>] ? manage_workers+0x210/0x210
[67666.463678]  [<ffffffff8106d27e>] kthread+0x9e/0xb0
[67666.463678]  [<ffffffff8163f7b4>] kernel_thread_helper+0x4/0x10
[67666.463678]  [<ffffffff8163ce4a>] ? retint_restore_args+0xe/0xe
[67666.463678]  [<ffffffff8106d1e0>] ? kthread_worker_fn+0x190/0x190
[67666.463678]  [<ffffffff8163f7b0>] ? gs_change+0xb/0xb
[67666.463678] Code: 0c 10 4a 00 48 c7 c7 00 fb a1 81 31 c0 e8 fe 0f
4a 00 41 0f b6 74 24 40 ba 05 00 00 00 48 c7 c7 5b 75 a1 81 31 c0 e8
e5 0f 4a 00 <0f> 0b eb fe 65 48 8b
[67666.463678] RIP  [<ffffffff81198832>] fscache_put_operation+0x332/0x360
[67666.463678]  RSP <ffff880169e89dd0>
[67666.463279] ------------[ cut here ]------------
[67666.924857] ---[ end trace 0c287f8e8a3ed6ea ]---

At one point during that same run, I got this warning in the logs:

[30741.772165] httpd: page allocation failure: order:0, mode:0x11110
[30741.772175] Pid: 6343, comm: httpd Not tainted 3.1.0-rc8 #1
[30741.772180] Call Trace:
[30741.772199]  [<ffffffff810e524b>] warn_alloc_failed+0xfb/0x160
[30741.772205]  [<ffffffff810e1e8f>] ? zone_watermark_ok+0x1f/0x30
[30741.772210]  [<ffffffff810e4fa4>] ? get_page_from_freelist+0x444/0x5f0
[30741.772215]  [<ffffffff810e3720>] ? __zone_pcp_update+0xd0/0xd0
[30741.772220]  [<ffffffff810e58eb>] __alloc_pages_nodemask+0x54b/0x770
[30741.772229]  [<ffffffff8128c5ab>] cachefiles_read_or_alloc_pages+0xe8b/0xef0
[30741.772239]  [<ffffffff8106d6a7>] ? bit_waitqueue+0x17/0xb0
[30741.772243]  [<ffffffff8106d7ad>] ? wake_up_bit+0x2d/0x40
[30741.772252]  [<ffffffff81198f7a>] ? fscache_run_op+0x5a/0xa0
[30741.772256]  [<ffffffff81199653>] ? fscache_submit_op+0x373/0x5b0
[30741.772261]  [<ffffffff8119ba33>] __fscache_read_or_alloc_pages+0x3f3/0x530
[30741.772268]  [<ffffffff81251500>] __nfs_readpages_from_fscache+0x70/0x1c0
[30741.772275]  [<ffffffff812321ea>] nfs_readpages+0xca/0x1e0
[30741.772281]  [<ffffffff812d65de>] ? radix_tree_lookup+0x5e/0x80
[30741.772287]  [<ffffffff810e7e4a>] __do_page_cache_readahead+0x1ca/0x270
[30741.772291]  [<ffffffff810e7f11>] ra_submit+0x21/0x30
[30741.772295]  [<ffffffff810e813d>] ondemand_readahead+0x11d/0x250
[30741.772300]  [<ffffffff810e8319>] page_cache_async_readahead+0xa9/0xc0
[30741.772306]  [<ffffffff8114dd32>] __generic_file_splice_read+0x402/0x500
[30741.772315]  [<ffffffff8163c904>] ? _raw_spin_unlock_bh+0x14/0x20
[30741.772324]  [<ffffffff815925af>] ? tcp_sendpage+0xcf/0x6e0
[30741.772331]  [<ffffffff815b5f0f>] ? inet_sendpage+0x7f/0x110
[30741.772336]  [<ffffffff8114b7f0>] ? page_cache_pipe_buf_release+0x30/0x30
[30741.772341]  [<ffffffff8114de7f>] generic_file_splice_read+0x4f/0x90
[30741.772349]  [<ffffffff81224ccd>] nfs_file_splice_read+0x8d/0xe0
[30741.772353]  [<ffffffff8114c157>] do_splice_to+0x77/0xb0
[30741.772357]  [<ffffffff8114c9ac>] splice_direct_to_actor+0xcc/0x1e0
[30741.772362]  [<ffffffff8114c0b0>] ? do_splice_from+0xb0/0xb0
[30741.772366]  [<ffffffff8114cb17>] do_splice_direct+0x57/0x80
[30741.772372]  [<ffffffff81122026>] do_sendfile+0x166/0x1d0
[30741.772379]  [<ffffffff811347a8>] ? poll_select_set_timeout+0x88/0xa0
[30741.772383]  [<ffffffff8112211d>] sys_sendfile64+0x8d/0xb0
[30741.772389]  [<ffffffff8163d66b>] system_call_fastpath+0x16/0x1b

Not sure why it'd give that, but sysstat shows that it definitely
wasn't out of memory (and it's got 16gb of swap, so it'd take a while
-- i.e. at least once caught by sysstat -- before it actually did run
out). But that was also a good while before the crash. Dunno if
related, but I figured I should mention it. I've not seen it any other
time though.

The second crash was just a few minutes ago, but the traceback looks
like the others:

[24724.296274] kernel BUG at fs/fscache/operation.c:426!
[24724.300702] invalid opcode: 0000 [#1] SMP
[24724.311107] CPU 1
[24724.311107] Modules linked in: xfs loop joydev evdev psmouse dcdbas
pcspkr serio_raw shpchp i5000_edac edac_core pci_hotplug i5k_amb sg
sr_mod cdrom ]
[24724.311107]
[24724.311107] Pid: 14830, comm: kworker/u:1 Not tainted 3.1.0-rc8 #1
Dell Inc. PowerEdge 1950/0DT097
[24724.311107] RIP: 0010:[<ffffffff81198832>]  [<ffffffff81198832>]
fscache_put_operation+0x332/0x360
[24724.390018] RSP: 0000:ffff880082063dd0  EFLAGS: 00010286
[24724.390018] RAX: 0000000000000025 RBX: ffff8802245b80c4 RCX: ffffffff81dfb040
[24724.410019] RDX: 00000000ffffffff RSI: 0000000000000086 RDI: ffffffff81dfaf30
[24724.410019] RBP: ffff880082063df0 R08: 0000000000000006 R09: 0000000000000000
[24724.410019] R10: 0000000000000000 R11: 0000000000000003 R12: ffff8802245b8080
[24724.410019] R13: ffff8801b89fd210 R14: ffffffff81e1d780 R15: ffff880227255305
[24724.410019] FS:  0000000000000000(0000) GS:ffff88022fc40000(0000)
knlGS:0000000000000000
[24724.410019] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[24724.410019] CR2: 00000000047bf0f8 CR3: 00000001b854b000 CR4: 00000000000006e0
[24724.410019] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[24724.410019] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[24724.410019] Process kworker/u:1 (pid: 14830, threadinfo
ffff880082062000, task ffff8801cab3c380)
[24724.410019] Stack:
[24724.410019]  ffff8802245b8080 ffff8802245b8080 ffff8802245b8080
ffff8801b89fd210
[24724.410019]  ffff880082063e10 ffffffff8119889b ffffffff81e1d780
0000000000000000
[24724.410019]  ffff880082063e60 ffffffff810661d4 0000000000000000
ffff880227255300
[24724.410019] Call Trace:
[24724.410019]  [<ffffffff8119889b>] fscache_op_work_func+0x3b/0xd0
[24724.410019]  [<ffffffff810661d4>] process_one_work+0x164/0x440
[24724.410019]  [<ffffffff81198860>] ? fscache_put_operation+0x360/0x360
[24724.410019]  [<ffffffff8106685b>] worker_thread+0x19b/0x430
[24724.410019]  [<ffffffff810666c0>] ? manage_workers+0x210/0x210
[24724.410019]  [<ffffffff8106d27e>] kthread+0x9e/0xb0
[24724.410019]  [<ffffffff8163f7b4>] kernel_thread_helper+0x4/0x10
[24724.410019]  [<ffffffff8163ce4a>] ? retint_restore_args+0xe/0xe
[24724.410019]  [<ffffffff8106d1e0>] ? kthread_worker_fn+0x190/0x190
[24724.410019]  [<ffffffff8163f7b0>] ? gs_change+0xb/0xb
[24724.410019] Code: 0c 10 4a 00 48 c7 c7 00 fb a1 81 31 c0 e8 fe 0f
4a 00 41 0f b6 74 24 40 ba 05 00 00 00 48 c7 c7 5b 75 a1 81 31 c0 e8
e5 0f 4a 00 <0
[24724.410019] RIP  [<ffffffff81198832>] fscache_put_operation+0x332/0x360
[24724.410019]  RSP <ffff880082063dd0>
[24724.798329] ---[ end trace 54b4ea9a2f86ea9e ]---




>> One slightly interesting thing, unrelated to fscache: This box is a
>> part of a pool of servers, serving the same web workloads. Another box
>> in this same pool is running 3.0.4, up for about 23 days (vs 6 hrs),
>> and the nfs_inode_cache is approximately 1/4 of the 3.1.0-rc8's,
>> size-wise, 1/3 #ofobjects-wise; likewise dentry in a 3.0.4 box with a
>> much longer uptime is about 1/9 the size (200k objs vs 1.8mil objects,
>> 45megs vs 400megs) as the 3.1.0-rc8 box. Dunno if that's the result of
>> VM improvements or a symptom of something leaking :)
>
> It also depends on what the load consists of.  For instance someone running a
> lot of find commands would cause the server to skew in favour of inodes over
> data, but someone reading/writing big files would skew it the other way.

This is static web serving, 100% over NFS, so it'll be traversing a
lot of inodes and directories, as well as possibly reading very large
files, so I guess that'd be somewhere in between.


> Do I take it the 3.0.4 box is not running fscache, but the 3.1.0-rc8 box is?

The 3.0.4 box I mentioned is not running fscache, no. I mentioned it
mainly cause it was sort of interesting, but also in the interest of
bringing up anything that might even be tangentially related to this.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Linux-cachefs] 3.0.3 64-bit Crash running fscache/cachefilesd
  2011-10-20  8:46                                 ` David Howells
@ 2011-10-20 19:37                                   ` Mark Moseley
  0 siblings, 0 replies; 46+ messages in thread
From: Mark Moseley @ 2011-10-20 19:37 UTC (permalink / raw)
  To: David Howells; +Cc: Linux filesystem caching discussion list, linux-kernel

On Thu, Oct 20, 2011 at 1:46 AM, David Howells <dhowells@redhat.com> wrote:
> Mark Moseley <moseleymark@gmail.com> wrote:
>
>> I left the existing cache in place. For a few hours, it was culling
>
> That will be whilst the inode cache was filling, I suspect.
>
>> Is there anything I can do to verify whether the objects are indeed
>> pinned? This is a pretty busy box. The nfs inode cache is quite large
>> (from slabtop):
>
> Hmmm...  Can you get me a dump of /proc/fs/fscache/stats to look at?
>
> The "Objects:" and "Relinqs:" lines are the most interesting.  On the first
> line avl=N shows the number of objects that are in the "available" state -
> ie. are live for caching.
>
> Also, can you do:
>
>        df -i /path/to/cache/partition
>
> to get the number of inodes available and used.


# df /var/cache/fscache/
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/sdb6             29338652  16148968  11699352  58% /var/cache/fscache

# df -i /var/cache/fscache/
Filesystem            Inodes   IUsed   IFree IUse% Mounted on
/dev/sdb6            1864128  339163 1524965   19% /var/cache/fscache

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Linux-cachefs] 3.0.3 64-bit Crash running fscache/cachefilesd
  2011-10-20  9:03                                 ` David Howells
  2011-10-20 19:29                                   ` Mark Moseley
@ 2011-10-20 23:05                                   ` David Howells
  2011-10-21  0:21                                     ` Mark Moseley
  2011-10-21  8:16                                     ` David Howells
  1 sibling, 2 replies; 46+ messages in thread
From: David Howells @ 2011-10-20 23:05 UTC (permalink / raw)
  To: Mark Moseley
  Cc: dhowells, Linux filesystem caching discussion list, linux-kernel

Mark Moseley <moseleymark@gmail.com> wrote:

> Nope, it wasn't. I had to reboot it back into the previous kernel for
> the weekend. I got a couple of oops though in the past 12 hours. I
> unfortunately hadn't fired up that dump again though (I have now and
> will send a new dump when I get another crash)

Okay.  I'm away next week at a pair of conferences, but I'll try and have a
look.

> [67666.389761] kernel BUG at fs/fscache/operation.c:426!

Can you have a look what's on line 426 for you, please?  It's a blank line in
my kernel.  I think I've managed to rearrange my patches to muck the ordering
up.

David

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Linux-cachefs] 3.0.3 64-bit Crash running fscache/cachefilesd
  2011-10-20 23:05                                   ` David Howells
@ 2011-10-21  0:21                                     ` Mark Moseley
  2011-10-21  8:16                                     ` David Howells
  1 sibling, 0 replies; 46+ messages in thread
From: Mark Moseley @ 2011-10-21  0:21 UTC (permalink / raw)
  To: David Howells; +Cc: Linux filesystem caching discussion list, linux-kernel

On Thu, Oct 20, 2011 at 4:05 PM, David Howells <dhowells@redhat.com> wrote:
> Mark Moseley <moseleymark@gmail.com> wrote:
>
>> Nope, it wasn't. I had to reboot it back into the previous kernel for
>> the weekend. I got a couple of oops though in the past 12 hours. I
>> unfortunately hadn't fired up that dump again though (I have now and
>> will send a new dump when I get another crash)
>
> Okay.  I'm away next week at a pair of conferences, but I'll try and have a
> look.
>
>> [67666.389761] kernel BUG at fs/fscache/operation.c:426!
>
> Can you have a look what's on line 426 for you, please?  It's a blank line in
> my kernel.  I think I've managed to rearrange my patches to muck the ordering
> up.

Here's line 426 with some context:


418         _debug("PUT OP");
419
420         if (op->state != FSCACHE_OP_ST_COMPLETE &&
421             op->state != FSCACHE_OP_ST_CANCELLED)
422                 printk("FS-Cache: Asserting on %s operation\n",
423                        fscache_op_names[op->name]);
424
425         ASSERTIFCMP(op->state != FSCACHE_OP_ST_COMPLETE,
426                     op->state, ==, FSCACHE_OP_ST_CANCELLED);
427         op->state = FSCACHE_OP_ST_DEAD;
428
429         fscache_stat(&fscache_n_op_release);
430
431         if (op->release) {
432                 op->release(op);
433                 op->release = NULL;
434         }
435
436         object = op->object;
437

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Linux-cachefs] 3.0.3 64-bit Crash running fscache/cachefilesd
  2011-10-20 23:05                                   ` David Howells
  2011-10-21  0:21                                     ` Mark Moseley
@ 2011-10-21  8:16                                     ` David Howells
  2011-10-21 18:09                                       ` Mark Moseley
  1 sibling, 1 reply; 46+ messages in thread
From: David Howells @ 2011-10-21  8:16 UTC (permalink / raw)
  To: Mark Moseley
  Cc: dhowells, Linux filesystem caching discussion list, linux-kernel

Mark Moseley <moseleymark@gmail.com> wrote:

> >> [67666.389761] kernel BUG at fs/fscache/operation.c:426!
> >
> > Can you have a look what's on line 426 for you, please?  It's a blank line
> > in my kernel.  I think I've managed to rearrange my patches to muck the
> > ordering up.
> 
> Here's line 426 with some context:

Ah...  It's the same one you've seen before - and now the debugging patch I
gave you has yielded a bit more context.

David

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Linux-cachefs] 3.0.3 64-bit Crash running fscache/cachefilesd
  2011-10-21  8:16                                     ` David Howells
@ 2011-10-21 18:09                                       ` Mark Moseley
  2011-12-13  1:56                                         ` Mark Moseley
  0 siblings, 1 reply; 46+ messages in thread
From: Mark Moseley @ 2011-10-21 18:09 UTC (permalink / raw)
  To: David Howells; +Cc: Linux filesystem caching discussion list, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 796 bytes --]

On Fri, Oct 21, 2011 at 1:16 AM, David Howells <dhowells@redhat.com> wrote:
> Mark Moseley <moseleymark@gmail.com> wrote:
>
>> >> [67666.389761] kernel BUG at fs/fscache/operation.c:426!
>> >
>> > Can you have a look what's on line 426 for you, please?  It's a blank line
>> > in my kernel.  I think I've managed to rearrange my patches to muck the
>> > ordering up.
>>
>> Here's line 426 with some context:
>
> Ah...  It's the same one you've seen before - and now the debugging patch I
> gave you has yielded a bit more context.
>
> David
>

Ok, got another of the "[33353.585702] kernel BUG at
fs/fscache/operation.c:426!" variety. I've attached a dump of fscache
stats from basically the entire uptime of the box, minus the first
couple minutes, right till it went down.

[-- Attachment #2: fscache.stats.20111021.gz --]
[-- Type: application/x-gzip, Size: 85232 bytes --]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Linux-cachefs] 3.0.3 64-bit Crash running fscache/cachefilesd
  2011-10-21 18:09                                       ` Mark Moseley
@ 2011-12-13  1:56                                         ` Mark Moseley
  0 siblings, 0 replies; 46+ messages in thread
From: Mark Moseley @ 2011-12-13  1:56 UTC (permalink / raw)
  Cc: Linux filesystem caching discussion list, linux-kernel

On Fri, Oct 21, 2011 at 11:09 AM, Mark Moseley <moseleymark@gmail.com> wrote:
> On Fri, Oct 21, 2011 at 1:16 AM, David Howells <dhowells@redhat.com> wrote:
>> Mark Moseley <moseleymark@gmail.com> wrote:
>>
>>> >> [67666.389761] kernel BUG at fs/fscache/operation.c:426!
>>> >
>>> > Can you have a look what's on line 426 for you, please?  It's a blank line
>>> > in my kernel.  I think I've managed to rearrange my patches to muck the
>>> > ordering up.
>>>
>>> Here's line 426 with some context:
>>
>> Ah...  It's the same one you've seen before - and now the debugging patch I
>> gave you has yielded a bit more context.
>>
>> David
>>
>
> Ok, got another of the "[33353.585702] kernel BUG at
> fs/fscache/operation.c:426!" variety. I've attached a dump of fscache
> stats from basically the entire uptime of the box, minus the first
> couple minutes, right till it went down.

Anything I should be trying/testing on 3.1.5 or 3.2-rc?

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2011-12-13  1:56 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-08-25 16:44 3.0.3 64-bit Crash running fscache/cachefilesd Mark Moseley
2011-08-26 12:52 ` [Linux-cachefs] " Дмитрий Ильин
2011-09-01 13:04 ` David Howells
2011-09-22 17:03   ` Mark Moseley
2011-09-22 21:41     ` Mark Moseley
2011-09-26 11:32     ` David Howells
2011-09-26 21:02       ` Mark Moseley
2011-09-27  0:59         ` Mark Moseley
2011-09-27 23:46           ` Mark Moseley
2011-09-29 14:57         ` David Howells
2011-09-29 15:51           ` Mark Moseley
2011-09-29 16:30           ` David Howells
2011-09-29 19:02             ` Mark Moseley
2011-09-29 22:11               ` Mark Moseley
2011-09-29 22:44                 ` Mark Moseley
2011-09-29 22:44               ` David Howells
2011-09-29 22:51                 ` Mark Moseley
2011-09-30 12:28                 ` David Howells
2011-09-30 18:57                   ` Mark Moseley
2011-09-30 20:10                   ` David Howells
2011-10-05 13:37                   ` David Howells
2011-10-05 13:49                   ` David Howells
2011-10-07 10:42                   ` David Howells
2011-10-08 16:32                     ` Mark Moseley
2011-10-11 13:07                     ` David Howells
2011-10-11 16:27                       ` Mark Moseley
2011-10-12  9:26                       ` David Howells
2011-10-12 10:05                       ` David Howells
2011-10-12 18:10                         ` Mark Moseley
2011-10-12 23:38                           ` Mark Moseley
2011-10-13 15:21                           ` David Howells
2011-10-13 20:48                             ` Mark Moseley
2011-10-14  9:22                             ` David Howells
2011-10-14 23:25                               ` Mark Moseley
2011-10-17 10:39                               ` David Howells
2011-10-19 12:25                               ` David Howells
2011-10-19 23:15                                 ` Mark Moseley
2011-10-20  8:46                                 ` David Howells
2011-10-20 19:37                                   ` Mark Moseley
2011-10-20  9:03                                 ` David Howells
2011-10-20 19:29                                   ` Mark Moseley
2011-10-20 23:05                                   ` David Howells
2011-10-21  0:21                                     ` Mark Moseley
2011-10-21  8:16                                     ` David Howells
2011-10-21 18:09                                       ` Mark Moseley
2011-12-13  1:56                                         ` Mark Moseley

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.