[4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment

All of lore.kernel.org
 help / color / mirror / Atom feed

* [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
@ 2016-12-14 22:24 Dave Chinner
  2016-12-14 22:29 ` Dave Chinner
  0 siblings, 1 reply; 57+ messages in thread
From: Dave Chinner @ 2016-12-14 22:24 UTC (permalink / raw)
  To: linux-kernel; +Cc: Lee Duncan, Chris Leech, open-iscsi

Hi folks,

Just updated my test boxes from 4.9 to a current Linus 4.10 merge
window kernel to test the XFS merge I am preparing for Linus.
Unfortunately, all my test VMs using iscsi failed pretty much
instantly on the first mount of an iscsi device:

[  159.372704] XFS (sdb): EXPERIMENTAL reverse mapping btree feature enabled. Use at your own risk!
[  159.374612] XFS (sdb): Mounting V5 Filesystem
[  159.425710] XFS (sdb): Ending clean mount
[  160.274438] BUG: unable to handle kernel NULL pointer dereference at 000000000000000c
[  160.275851] IP: iscsi_tcp_segment_done+0x20d/0x2e0
[  160.276565] PGD 336ed067 [  160.276885] PUD 31b0d067
PMD 0 [  160.277309]
[  160.277523] Oops: 0000 [#1] PREEMPT SMP
[  160.278004] Modules linked in:
[  160.278407] CPU: 0 PID: 16 Comm: kworker/u2:1 Not tainted 4.9.0-dgc #18
[  160.279224] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
[  160.280314] Workqueue: iscsi_q_2 iscsi_xmitworker
[  160.280919] task: ffff88003e280000 task.stack: ffffc90000080000
[  160.281647] RIP: 0010:iscsi_tcp_segment_done+0x20d/0x2e0
[  160.282312] RSP: 0018:ffffc90000083c38 EFLAGS: 00010206
[  160.282980] RAX: 0000000000000000 RBX: ffff880039061730 RCX: 0000000000000000
[  160.283854] RDX: 0000000000001e00 RSI: 0000000000000000 RDI: ffff880039061730
[  160.284738] RBP: ffffc90000083c90 R08: 0000000000000200 R09: 00000000000005a8
[  160.285627] R10: 000000009835607d R11: 0000000000000000 R12: 0000000000000200
[  160.286495] R13: 0000000000000000 R14: ffff8800390615a0 R15: ffff880039061730
[  160.287362] FS:  0000000000000000(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000
[  160.288340] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  160.289113] CR2: 000000000000000c CR3: 0000000031a8d000 CR4: 00000000000006f0
[  160.290084] Call Trace:
[  160.290429]  ? inet_sendpage+0x4d/0x140
[  160.290957]  iscsi_sw_tcp_xmit_segment+0x89/0x110
[  160.291597]  iscsi_sw_tcp_pdu_xmit+0x56/0x180
[  160.292190]  iscsi_tcp_task_xmit+0xb8/0x280
[  160.292771]  iscsi_xmit_task+0x53/0xc0
[  160.293282]  iscsi_xmitworker+0x274/0x310
[  160.293835]  process_one_work+0x1de/0x4d0
[  160.294388]  worker_thread+0x4b/0x4f0
[  160.294889]  kthread+0x10c/0x140
[  160.295333]  ? process_one_work+0x4d0/0x4d0
[  160.295898]  ? kthread_create_on_node+0x40/0x40
[  160.296525]  ret_from_fork+0x25/0x30
[  160.297015] Code: 43 18 00 00 00 00 e9 ad fe ff ff 48 8b 7b 30 e8 da e7 ca ff 8b 53 10 44 89 ee 48 89 df 2b 53 14 48 89 43 30 c7 43 40 00 00 00 00 <8b
[  160.300674] RIP: iscsi_tcp_segment_done+0x20d/0x2e0 RSP: ffffc90000083c38
[  160.301584] CR2: 000000000000000c


Known problem, or something new?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
  2016-12-14 22:24 [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0 Dave Chinner
@ 2016-12-14 22:29 ` Dave Chinner
  2016-12-16 18:59   ` Chris Leech
  0 siblings, 1 reply; 57+ messages in thread
From: Dave Chinner @ 2016-12-14 22:29 UTC (permalink / raw)
  To: linux-kernel; +Cc: Lee Duncan, Chris Leech, open-iscsi

On Thu, Dec 15, 2016 at 09:24:11AM +1100, Dave Chinner wrote:
> Hi folks,
> 
> Just updated my test boxes from 4.9 to a current Linus 4.10 merge
> window kernel to test the XFS merge I am preparing for Linus.
> Unfortunately, all my test VMs using iscsi failed pretty much
> instantly on the first mount of an iscsi device:
> 
> [  159.372704] XFS (sdb): EXPERIMENTAL reverse mapping btree feature enabled. Use at your own risk!
> [  159.374612] XFS (sdb): Mounting V5 Filesystem
> [  159.425710] XFS (sdb): Ending clean mount
> [  160.274438] BUG: unable to handle kernel NULL pointer dereference at 000000000000000c
> [  160.275851] IP: iscsi_tcp_segment_done+0x20d/0x2e0

FYI, crash is here:

(gdb) l *(iscsi_tcp_segment_done+0x20d)
0xffffffff81b950bd is in iscsi_tcp_segment_done (drivers/scsi/libiscsi_tcp.c:102).
97	iscsi_tcp_segment_init_sg(struct iscsi_segment *segment,
98				  struct scatterlist *sg, unsigned int offset)
99	{
100		segment->sg = sg;
101		segment->sg_offset = offset;
102		segment->size = min(sg->length - offset,
103				    segment->total_size - segment->total_copied);
104		segment->data = NULL;
105	}
106	

So it looks to be sg = NULL, which means there's probably an issue
with the scatterlist...

-Dave.

> [  160.276565] PGD 336ed067 [  160.276885] PUD 31b0d067
> PMD 0 [  160.277309]
> [  160.277523] Oops: 0000 [#1] PREEMPT SMP
> [  160.278004] Modules linked in:
> [  160.278407] CPU: 0 PID: 16 Comm: kworker/u2:1 Not tainted 4.9.0-dgc #18
> [  160.279224] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
> [  160.280314] Workqueue: iscsi_q_2 iscsi_xmitworker
> [  160.280919] task: ffff88003e280000 task.stack: ffffc90000080000
> [  160.281647] RIP: 0010:iscsi_tcp_segment_done+0x20d/0x2e0
> [  160.282312] RSP: 0018:ffffc90000083c38 EFLAGS: 00010206
> [  160.282980] RAX: 0000000000000000 RBX: ffff880039061730 RCX: 0000000000000000
> [  160.283854] RDX: 0000000000001e00 RSI: 0000000000000000 RDI: ffff880039061730
> [  160.284738] RBP: ffffc90000083c90 R08: 0000000000000200 R09: 00000000000005a8
> [  160.285627] R10: 000000009835607d R11: 0000000000000000 R12: 0000000000000200
> [  160.286495] R13: 0000000000000000 R14: ffff8800390615a0 R15: ffff880039061730
> [  160.287362] FS:  0000000000000000(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000
> [  160.288340] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  160.289113] CR2: 000000000000000c CR3: 0000000031a8d000 CR4: 00000000000006f0
> [  160.290084] Call Trace:
> [  160.290429]  ? inet_sendpage+0x4d/0x140
> [  160.290957]  iscsi_sw_tcp_xmit_segment+0x89/0x110
> [  160.291597]  iscsi_sw_tcp_pdu_xmit+0x56/0x180
> [  160.292190]  iscsi_tcp_task_xmit+0xb8/0x280
> [  160.292771]  iscsi_xmit_task+0x53/0xc0
> [  160.293282]  iscsi_xmitworker+0x274/0x310
> [  160.293835]  process_one_work+0x1de/0x4d0
> [  160.294388]  worker_thread+0x4b/0x4f0
> [  160.294889]  kthread+0x10c/0x140
> [  160.295333]  ? process_one_work+0x4d0/0x4d0
> [  160.295898]  ? kthread_create_on_node+0x40/0x40
> [  160.296525]  ret_from_fork+0x25/0x30
> [  160.297015] Code: 43 18 00 00 00 00 e9 ad fe ff ff 48 8b 7b 30 e8 da e7 ca ff 8b 53 10 44 89 ee 48 89 df 2b 53 14 48 89 43 30 c7 43 40 00 00 00 00 <8b
> [  160.300674] RIP: iscsi_tcp_segment_done+0x20d/0x2e0 RSP: ffffc90000083c38
> [  160.301584] CR2: 000000000000000c
> 
> 
> Known problem, or something new?
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
  2016-12-14 22:29 ` Dave Chinner
@ 2016-12-16 18:59   ` Chris Leech
  2016-12-21 22:16     ` Dave Chinner
  0 siblings, 1 reply; 57+ messages in thread
From: Chris Leech @ 2016-12-16 18:59 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, Lee Duncan, open-iscsi

Thanks Dave,

I'm hitting a bug at scatterlist.h:140 before I even get any iSCSI
modules loaded (virtio block) so there's something else going on in the
current merge window.  I'll keep an eye on it and make sure there's
nothing iSCSI needs fixing for.

Chris

On Thu, Dec 15, 2016 at 09:29:53AM +1100, Dave Chinner wrote:
> On Thu, Dec 15, 2016 at 09:24:11AM +1100, Dave Chinner wrote:
> > Hi folks,
> > 
> > Just updated my test boxes from 4.9 to a current Linus 4.10 merge
> > window kernel to test the XFS merge I am preparing for Linus.
> > Unfortunately, all my test VMs using iscsi failed pretty much
> > instantly on the first mount of an iscsi device:
> > 
> > [  159.372704] XFS (sdb): EXPERIMENTAL reverse mapping btree feature enabled. Use at your own risk!
> > [  159.374612] XFS (sdb): Mounting V5 Filesystem
> > [  159.425710] XFS (sdb): Ending clean mount
> > [  160.274438] BUG: unable to handle kernel NULL pointer dereference at 000000000000000c
> > [  160.275851] IP: iscsi_tcp_segment_done+0x20d/0x2e0
> 
> FYI, crash is here:
> 
> (gdb) l *(iscsi_tcp_segment_done+0x20d)
> 0xffffffff81b950bd is in iscsi_tcp_segment_done (drivers/scsi/libiscsi_tcp.c:102).
> 97	iscsi_tcp_segment_init_sg(struct iscsi_segment *segment,
> 98				  struct scatterlist *sg, unsigned int offset)
> 99	{
> 100		segment->sg = sg;
> 101		segment->sg_offset = offset;
> 102		segment->size = min(sg->length - offset,
> 103				    segment->total_size - segment->total_copied);
> 104		segment->data = NULL;
> 105	}
> 106	
> 
> So it looks to be sg = NULL, which means there's probably an issue
> with the scatterlist...
> 
> -Dave.
> 
> > [  160.276565] PGD 336ed067 [  160.276885] PUD 31b0d067
> > PMD 0 [  160.277309]
> > [  160.277523] Oops: 0000 [#1] PREEMPT SMP
> > [  160.278004] Modules linked in:
> > [  160.278407] CPU: 0 PID: 16 Comm: kworker/u2:1 Not tainted 4.9.0-dgc #18
> > [  160.279224] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
> > [  160.280314] Workqueue: iscsi_q_2 iscsi_xmitworker
> > [  160.280919] task: ffff88003e280000 task.stack: ffffc90000080000
> > [  160.281647] RIP: 0010:iscsi_tcp_segment_done+0x20d/0x2e0
> > [  160.282312] RSP: 0018:ffffc90000083c38 EFLAGS: 00010206
> > [  160.282980] RAX: 0000000000000000 RBX: ffff880039061730 RCX: 0000000000000000
> > [  160.283854] RDX: 0000000000001e00 RSI: 0000000000000000 RDI: ffff880039061730
> > [  160.284738] RBP: ffffc90000083c90 R08: 0000000000000200 R09: 00000000000005a8
> > [  160.285627] R10: 000000009835607d R11: 0000000000000000 R12: 0000000000000200
> > [  160.286495] R13: 0000000000000000 R14: ffff8800390615a0 R15: ffff880039061730
> > [  160.287362] FS:  0000000000000000(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000
> > [  160.288340] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [  160.289113] CR2: 000000000000000c CR3: 0000000031a8d000 CR4: 00000000000006f0
> > [  160.290084] Call Trace:
> > [  160.290429]  ? inet_sendpage+0x4d/0x140
> > [  160.290957]  iscsi_sw_tcp_xmit_segment+0x89/0x110
> > [  160.291597]  iscsi_sw_tcp_pdu_xmit+0x56/0x180
> > [  160.292190]  iscsi_tcp_task_xmit+0xb8/0x280
> > [  160.292771]  iscsi_xmit_task+0x53/0xc0
> > [  160.293282]  iscsi_xmitworker+0x274/0x310
> > [  160.293835]  process_one_work+0x1de/0x4d0
> > [  160.294388]  worker_thread+0x4b/0x4f0
> > [  160.294889]  kthread+0x10c/0x140
> > [  160.295333]  ? process_one_work+0x4d0/0x4d0
> > [  160.295898]  ? kthread_create_on_node+0x40/0x40
> > [  160.296525]  ret_from_fork+0x25/0x30
> > [  160.297015] Code: 43 18 00 00 00 00 e9 ad fe ff ff 48 8b 7b 30 e8 da e7 ca ff 8b 53 10 44 89 ee 48 89 df 2b 53 14 48 89 43 30 c7 43 40 00 00 00 00 <8b
> > [  160.300674] RIP: iscsi_tcp_segment_done+0x20d/0x2e0 RSP: ffffc90000083c38
> > [  160.301584] CR2: 000000000000000c
> > 
> > 
> > Known problem, or something new?
> > 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> > 
> 
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
  2016-12-16 18:59   ` Chris Leech
@ 2016-12-21 22:16     ` Dave Chinner
  2016-12-21 23:19       ` Linus Torvalds
  0 siblings, 1 reply; 57+ messages in thread
From: Dave Chinner @ 2016-12-21 22:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Chris Leech, linux-kernel, Lee Duncan, open-iscsi, linux-scsi,
	linux-block

On Fri, Dec 16, 2016 at 10:59:06AM -0800, Chris Leech wrote:
> Thanks Dave,
> 
> I'm hitting a bug at scatterlist.h:140 before I even get any iSCSI
> modules loaded (virtio block) so there's something else going on in the
> current merge window.  I'll keep an eye on it and make sure there's
> nothing iSCSI needs fixing for.

OK, so before this slips through the cracks.....

Linus - your tree as of a few minutes ago still panics immediately
when starting xfstests on iscsi devices. It appears to be a
scatterlist corruption and not an iscsi problem, so the iscsi guys
seem to have bounced it and no-one is looking at it.

I'm disappearing for several months at the end of tomorrow, so I
thought I better make sure you know about it.  I've also added
linux-scsi, linux-block to the cc list....

Cheers,

Dave.

> On Thu, Dec 15, 2016 at 09:29:53AM +1100, Dave Chinner wrote:
> > On Thu, Dec 15, 2016 at 09:24:11AM +1100, Dave Chinner wrote:
> > > Hi folks,
> > > 
> > > Just updated my test boxes from 4.9 to a current Linus 4.10 merge
> > > window kernel to test the XFS merge I am preparing for Linus.
> > > Unfortunately, all my test VMs using iscsi failed pretty much
> > > instantly on the first mount of an iscsi device:
> > > 
> > > [  159.372704] XFS (sdb): EXPERIMENTAL reverse mapping btree feature enabled. Use at your own risk!
> > > [  159.374612] XFS (sdb): Mounting V5 Filesystem
> > > [  159.425710] XFS (sdb): Ending clean mount
> > > [  160.274438] BUG: unable to handle kernel NULL pointer dereference at 000000000000000c
> > > [  160.275851] IP: iscsi_tcp_segment_done+0x20d/0x2e0
> > 
> > FYI, crash is here:
> > 
> > (gdb) l *(iscsi_tcp_segment_done+0x20d)
> > 0xffffffff81b950bd is in iscsi_tcp_segment_done (drivers/scsi/libiscsi_tcp.c:102).
> > 97	iscsi_tcp_segment_init_sg(struct iscsi_segment *segment,
> > 98				  struct scatterlist *sg, unsigned int offset)
> > 99	{
> > 100		segment->sg = sg;
> > 101		segment->sg_offset = offset;
> > 102		segment->size = min(sg->length - offset,
> > 103				    segment->total_size - segment->total_copied);
> > 104		segment->data = NULL;
> > 105	}
> > 106	
> > 
> > So it looks to be sg = NULL, which means there's probably an issue
> > with the scatterlist...
> > 
> > -Dave.
> > 
> > > [  160.276565] PGD 336ed067 [  160.276885] PUD 31b0d067
> > > PMD 0 [  160.277309]
> > > [  160.277523] Oops: 0000 [#1] PREEMPT SMP
> > > [  160.278004] Modules linked in:
> > > [  160.278407] CPU: 0 PID: 16 Comm: kworker/u2:1 Not tainted 4.9.0-dgc #18
> > > [  160.279224] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
> > > [  160.280314] Workqueue: iscsi_q_2 iscsi_xmitworker
> > > [  160.280919] task: ffff88003e280000 task.stack: ffffc90000080000
> > > [  160.281647] RIP: 0010:iscsi_tcp_segment_done+0x20d/0x2e0
> > > [  160.282312] RSP: 0018:ffffc90000083c38 EFLAGS: 00010206
> > > [  160.282980] RAX: 0000000000000000 RBX: ffff880039061730 RCX: 0000000000000000
> > > [  160.283854] RDX: 0000000000001e00 RSI: 0000000000000000 RDI: ffff880039061730
> > > [  160.284738] RBP: ffffc90000083c90 R08: 0000000000000200 R09: 00000000000005a8
> > > [  160.285627] R10: 000000009835607d R11: 0000000000000000 R12: 0000000000000200
> > > [  160.286495] R13: 0000000000000000 R14: ffff8800390615a0 R15: ffff880039061730
> > > [  160.287362] FS:  0000000000000000(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000
> > > [  160.288340] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [  160.289113] CR2: 000000000000000c CR3: 0000000031a8d000 CR4: 00000000000006f0
> > > [  160.290084] Call Trace:
> > > [  160.290429]  ? inet_sendpage+0x4d/0x140
> > > [  160.290957]  iscsi_sw_tcp_xmit_segment+0x89/0x110
> > > [  160.291597]  iscsi_sw_tcp_pdu_xmit+0x56/0x180
> > > [  160.292190]  iscsi_tcp_task_xmit+0xb8/0x280
> > > [  160.292771]  iscsi_xmit_task+0x53/0xc0
> > > [  160.293282]  iscsi_xmitworker+0x274/0x310
> > > [  160.293835]  process_one_work+0x1de/0x4d0
> > > [  160.294388]  worker_thread+0x4b/0x4f0
> > > [  160.294889]  kthread+0x10c/0x140
> > > [  160.295333]  ? process_one_work+0x4d0/0x4d0
> > > [  160.295898]  ? kthread_create_on_node+0x40/0x40
> > > [  160.296525]  ret_from_fork+0x25/0x30
> > > [  160.297015] Code: 43 18 00 00 00 00 e9 ad fe ff ff 48 8b 7b 30 e8 da e7 ca ff 8b 53 10 44 89 ee 48 89 df 2b 53 14 48 89 43 30 c7 43 40 00 00 00 00 <8b
> > > [  160.300674] RIP: iscsi_tcp_segment_done+0x20d/0x2e0 RSP: ffffc90000083c38
> > > [  160.301584] CR2: 000000000000000c
> > > 
> > > 
> > > Known problem, or something new?
> > > 
> > > Cheers,
> > > 
> > > Dave.
> > > -- 
> > > Dave Chinner
> > > david@fromorbit.com
> > > 
> > 
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> 

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
  2016-12-21 22:16     ` Dave Chinner
@ 2016-12-21 23:19       ` Linus Torvalds
  2016-12-22  0:13         ` Chris Leech
  2016-12-22  6:18         ` Christoph Hellwig
  0 siblings, 2 replies; 57+ messages in thread
From: Linus Torvalds @ 2016-12-21 23:19 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Chris Leech, Linux Kernel Mailing List, Lee Duncan, open-iscsi,
	Linux SCSI List, linux-block, Christoph Hellwig

Hi,

On Wed, Dec 21, 2016 at 2:16 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Fri, Dec 16, 2016 at 10:59:06AM -0800, Chris Leech wrote:
>> Thanks Dave,
>>
>> I'm hitting a bug at scatterlist.h:140 before I even get any iSCSI
>> modules loaded (virtio block) so there's something else going on in the
>> current merge window.  I'll keep an eye on it and make sure there's
>> nothing iSCSI needs fixing for.
>
> OK, so before this slips through the cracks.....
>
> Linus - your tree as of a few minutes ago still panics immediately
> when starting xfstests on iscsi devices. It appears to be a
> scatterlist corruption and not an iscsi problem, so the iscsi guys
> seem to have bounced it and no-one is looking at it.

Hmm. There's not much to go by.

Can somebody in iscsi-land please try to just bisect it - I'm not
seeing a lot of clues to where this comes from otherwise.

There clearly hasn't been anything done to drivers/scsi/*iscsi* since
4.9, so it's coming from elsewhere, presumably the block layer changes
as you say.

But I think we need iscsi people to pinpoint it a bit more, since
nobody else seems to be complaining.

Looking around a bit, the only even halfway suspicious scatterlist
initialization thing I see is commit f9d03f96b988 ("block: improve
handling of the magic discard payload") which used to have a magic
hack wrt !bio->bi_vcnt, and that got removed. See __blk_bios_map_sg(),
now it does __blk_bvec_map_sg() instead.

Maybe __blk_bvec_map_sg() would want that same

    if (!bio->bi_vcnt)
        return 0;

adding Christoph explicitly so that he can tell me why I'm a clueless
weenie. Although I suspect he's already on the scsi and block lists
and would have done so anyway.

> I'm disappearing for several months at the end of tomorrow, so I
> thought I better make sure you know about it.  I've also added
> linux-scsi, linux-block to the cc list....

Thanks,

            Linus

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
  2016-12-21 23:19       ` Linus Torvalds
@ 2016-12-22  0:13         ` Chris Leech
  2016-12-22  5:13           ` Dave Chinner
  2016-12-22  6:18         ` Christoph Hellwig
  1 sibling, 1 reply; 57+ messages in thread
From: Chris Leech @ 2016-12-22  0:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, Linux Kernel Mailing List, Lee Duncan, open-iscsi,
	Linux SCSI List, linux-block, Christoph Hellwig

On Wed, Dec 21, 2016 at 03:19:15PM -0800, Linus Torvalds wrote:
> Hi,
> 
> On Wed, Dec 21, 2016 at 2:16 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Fri, Dec 16, 2016 at 10:59:06AM -0800, Chris Leech wrote:
> >> Thanks Dave,
> >>
> >> I'm hitting a bug at scatterlist.h:140 before I even get any iSCSI
> >> modules loaded (virtio block) so there's something else going on in the
> >> current merge window.  I'll keep an eye on it and make sure there's
> >> nothing iSCSI needs fixing for.
> >
> > OK, so before this slips through the cracks.....
> >
> > Linus - your tree as of a few minutes ago still panics immediately
> > when starting xfstests on iscsi devices. It appears to be a
> > scatterlist corruption and not an iscsi problem, so the iscsi guys
> > seem to have bounced it and no-one is looking at it.
> 
> Hmm. There's not much to go by.
> 
> Can somebody in iscsi-land please try to just bisect it - I'm not
> seeing a lot of clues to where this comes from otherwise.

Yeah, my hopes of this being quickly resolved by someone else didn't
work out and whatever is going on in that test VM is looking like a
different kind of odd.  I'm saving that off for later, and seeing if I
can't be a bisect on the iSCSI issue.

Chris

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
  2016-12-22  0:13         ` Chris Leech
@ 2016-12-22  5:13           ` Dave Chinner
  2016-12-22  5:46             ` Linus Torvalds
  2016-12-22  6:28             ` Dave Chinner
  0 siblings, 2 replies; 57+ messages in thread
From: Dave Chinner @ 2016-12-22  5:13 UTC (permalink / raw)
  To: Chris Leech, Linus Torvalds, Linux Kernel Mailing List,
	Lee Duncan, open-iscsi, Linux SCSI List, linux-block,
	Christoph Hellwig

On Wed, Dec 21, 2016 at 04:13:03PM -0800, Chris Leech wrote:
> On Wed, Dec 21, 2016 at 03:19:15PM -0800, Linus Torvalds wrote:
> > Hi,
> > 
> > On Wed, Dec 21, 2016 at 2:16 PM, Dave Chinner <david@fromorbit.com> wrote:
> > > On Fri, Dec 16, 2016 at 10:59:06AM -0800, Chris Leech wrote:
> > >> Thanks Dave,
> > >>
> > >> I'm hitting a bug at scatterlist.h:140 before I even get any iSCSI
> > >> modules loaded (virtio block) so there's something else going on in the
> > >> current merge window.  I'll keep an eye on it and make sure there's
> > >> nothing iSCSI needs fixing for.
> > >
> > > OK, so before this slips through the cracks.....
> > >
> > > Linus - your tree as of a few minutes ago still panics immediately
> > > when starting xfstests on iscsi devices. It appears to be a
> > > scatterlist corruption and not an iscsi problem, so the iscsi guys
> > > seem to have bounced it and no-one is looking at it.
> > 
> > Hmm. There's not much to go by.
> > 
> > Can somebody in iscsi-land please try to just bisect it - I'm not
> > seeing a lot of clues to where this comes from otherwise.
> 
> Yeah, my hopes of this being quickly resolved by someone else didn't
> work out and whatever is going on in that test VM is looking like a
> different kind of odd.  I'm saving that off for later, and seeing if I
> can't be a bisect on the iSCSI issue.

There may be deeper issues. I just started running scalability tests
(e.g. 16-way fsmark create tests) and about a minute in I got a
directory corruption reported - something I hadn't seen in the dev
cycle at all. I unmounted the fs, mkfs'd it again, ran the
workload again and about a minute in this fired:

[628867.607417] ------------[ cut here ]------------
[628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 shadow_lru_isolate+0x171/0x220
[628867.610702] Modules linked in:
[628867.611375] CPU: 2 PID: 16925 Comm: kworker/2:97 Tainted: G        W       4.9.0-dgc #18
[628867.613382] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
[628867.616179] Workqueue: events rht_deferred_worker
[628867.632422] Call Trace:
[628867.634691]  dump_stack+0x63/0x83
[628867.637937]  __warn+0xcb/0xf0
[628867.641359]  warn_slowpath_null+0x1d/0x20
[628867.643362]  shadow_lru_isolate+0x171/0x220
[628867.644627]  __list_lru_walk_one.isra.11+0x79/0x110
[628867.645780]  ? __list_lru_init+0x70/0x70
[628867.646628]  list_lru_walk_one+0x17/0x20
[628867.647488]  scan_shadow_nodes+0x34/0x50
[628867.648358]  shrink_slab.part.65.constprop.86+0x1dc/0x410
[628867.649506]  shrink_node+0x57/0x90
[628867.650233]  do_try_to_free_pages+0xdd/0x230
[628867.651157]  try_to_free_pages+0xce/0x1a0
[628867.652342]  __alloc_pages_slowpath+0x2df/0x960
[628867.653332]  ? __might_sleep+0x4a/0x80
[628867.654148]  __alloc_pages_nodemask+0x24b/0x290
[628867.655237]  kmalloc_order+0x21/0x50
[628867.656016]  kmalloc_order_trace+0x24/0xc0
[628867.656878]  __kmalloc+0x17d/0x1d0
[628867.657644]  bucket_table_alloc+0x195/0x1d0
[628867.658564]  ? __might_sleep+0x4a/0x80
[628867.659449]  rht_deferred_worker+0x287/0x3c0
[628867.660366]  ? _raw_spin_unlock_irq+0xe/0x30
[628867.661294]  process_one_work+0x1de/0x4d0
[628867.662208]  worker_thread+0x4b/0x4f0
[628867.662990]  kthread+0x10c/0x140
[628867.663687]  ? process_one_work+0x4d0/0x4d0
[628867.664564]  ? kthread_create_on_node+0x40/0x40
[628867.665523]  ret_from_fork+0x25/0x30
[628867.666317] ---[ end trace 7c38634006a9955e ]---

Now, this workload does not touch the page cache at all - it's
entirely an XFS metadata workload, so it should not really be
affecting the working set code.

And worse, on that last error, the /host/ is now going into meltdown
(running 4.7.5) with 32 CPUs all burning down in ACPI code:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
35074 root      -2   0       0      0      0 R  99.0  0.0  12:38.92 acpi_pad/12
35079 root      -2   0       0      0      0 R  99.0  0.0  12:39.40 acpi_pad/16
35080 root      -2   0       0      0      0 R  99.0  0.0  12:39.29 acpi_pad/17
35085 root      -2   0       0      0      0 R  99.0  0.0  12:39.35 acpi_pad/22
35087 root      -2   0       0      0      0 R  99.0  0.0  12:39.13 acpi_pad/24
35090 root      -2   0       0      0      0 R  99.0  0.0  12:38.89 acpi_pad/27
35093 root      -2   0       0      0      0 R  99.0  0.0  12:38.88 acpi_pad/30
35063 root      -2   0       0      0      0 R  98.1  0.0  12:40.64 acpi_pad/1
35065 root      -2   0       0      0      0 R  98.1  0.0  12:40.38 acpi_pad/3
35066 root      -2   0       0      0      0 R  98.1  0.0  12:40.30 acpi_pad/4
35067 root      -2   0       0      0      0 R  98.1  0.0  12:40.82 acpi_pad/5
35077 root      -2   0       0      0      0 R  98.1  0.0  12:39.65 acpi_pad/14
35078 root      -2   0       0      0      0 R  98.1  0.0  12:39.58 acpi_pad/15
35081 root      -2   0       0      0      0 R  98.1  0.0  12:39.32 acpi_pad/18
35072 root      -2   0       0      0      0 R  96.2  0.0  12:40.14 acpi_pad/10
35073 root      -2   0       0      0      0 R  96.2  0.0  12:39.39 acpi_pad/11
35076 root      -2   0       0      0      0 R  96.2  0.0  12:39.39 acpi_pad/13
35084 root      -2   0       0      0      0 R  96.2  0.0  12:39.06 acpi_pad/21
35092 root      -2   0       0      0      0 R  96.2  0.0  12:39.14 acpi_pad/29
35069 root      -2   0       0      0      0 R  95.2  0.0  12:40.71 acpi_pad/7
35068 root      -2   0       0      0      0 R  94.2  0.0  12:40.29 acpi_pad/6
35062 root      -2   0       0      0      0 D  93.3  0.0  12:40.56 acpi_pad/0
35064 root      -2   0       0      0      0 D  92.3  0.0  12:40.18 acpi_pad/2
35082 root      -2   0       0      0      0 R  92.3  0.0  12:39.64 acpi_pad/19
35083 root      -2   0       0      0      0 R  92.3  0.0  12:38.98 acpi_pad/20
35086 root      -2   0       0      0      0 R  92.3  0.0  12:40.11 acpi_pad/23
35088 root      -2   0       0      0      0 R  92.3  0.0  12:39.45 acpi_pad/25
35089 root      -2   0       0      0      0 R  92.3  0.0  12:39.11 acpi_pad/26
35070 root      -2   0       0      0      0 D  91.3  0.0  12:40.21 acpi_pad/8
35071 root      -2   0       0      0      0 D  91.3  0.0  12:39.98 acpi_pad/9
35091 root      -2   0       0      0      0 D  91.3  0.0  12:39.33 acpi_pad/28

perf top says:

  65.98%  [kernel]  [k] power_saving_thread
   3.27%  [kernel]  [k] native_queued_spin_lock_slowpath
   1.61%  [kernel]  [k] native_write_msr
   1.39%  [kernel]  [k] update_curr_rt
   1.20%  [kernel]  [k] intel_pstate_update_util
   1.01%  [kernel]  [k] __do_softirq
   1.01%  [kernel]  [k] ktime_get
   0.99%  [kernel]  [k] ktime_get_update_offsets_now
   0.93%  [kernel]  [k] rcu_check_callbacks
   0.90%  [kernel]  [k] _raw_spin_lock
   0.88%  [kernel]  [k] perf_event_task_tick
   0.82%  [kernel]  [k] native_irq_return_iret
   0.81%  [kernel]  [k] run_timer_softirq
   0.75%  [kernel]  [k] trigger_load_balance

No idea how to recover this, so I'm just going to reboot it. Back in
a bit.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
  2016-12-22  5:13           ` Dave Chinner
@ 2016-12-22  5:46             ` Linus Torvalds
  2016-12-22  6:50               ` Dave Chinner
  2016-12-22 20:22               ` Hugh Dickins
  2016-12-22  6:28             ` Dave Chinner
  1 sibling, 2 replies; 57+ messages in thread
From: Linus Torvalds @ 2016-12-22  5:46 UTC (permalink / raw)
  To: Dave Chinner, Johannes Weiner
  Cc: Chris Leech, Linux Kernel Mailing List, Lee Duncan, open-iscsi,
	Linux SCSI List, linux-block, Christoph Hellwig

On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner <david@fromorbit.com> wrote:
>
> There may be deeper issues. I just started running scalability tests
> (e.g. 16-way fsmark create tests) and about a minute in I got a
> directory corruption reported - something I hadn't seen in the dev
> cycle at all.

By "in the dev cycle", do you mean your XFS changes, or have you been
tracking the merge cycle at least for some testing?

> I unmounted the fs, mkfs'd it again, ran the
> workload again and about a minute in this fired:
>
> [628867.607417] ------------[ cut here ]------------
> [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 shadow_lru_isolate+0x171/0x220

Well, part of the changes during the merge window were the shadow
entry tracking changes that came in through Andrew's tree. Adding
Johannes Weiner to the participants.

> Now, this workload does not touch the page cache at all - it's
> entirely an XFS metadata workload, so it should not really be
> affecting the working set code.

Well, I suspect that anything that creates memory pressure will end up
triggering the working set code, so ..

That said, obviously memory corruption could be involved and result in
random issues too, but I wouldn't really expect that in this code.

It would probably be really useful to get more data points - is the
problem reliably in this area, or is it going to be random and all
over the place.

That said:

> And worse, on that last error, the /host/ is now going into meltdown
> (running 4.7.5) with 32 CPUs all burning down in ACPI code:

The obvious question here is how much you trust the environment if the
host ends up also showing problems. Maybe you do end up having hw
issues pop up too.

The primary suspect would presumably be the development kernel you're
testing triggering something, but it has to be asked..

                     Linus

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
  2016-12-21 23:19       ` Linus Torvalds
  2016-12-22  0:13         ` Chris Leech
@ 2016-12-22  6:18         ` Christoph Hellwig
  2016-12-22  6:30           ` Dave Chinner
  1 sibling, 1 reply; 57+ messages in thread
From: Christoph Hellwig @ 2016-12-22  6:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, Chris Leech, Linux Kernel Mailing List, Lee Duncan,
	open-iscsi, Linux SCSI List, linux-block, Christoph Hellwig

On Wed, Dec 21, 2016 at 03:19:15PM -0800, Linus Torvalds wrote:
> Looking around a bit, the only even halfway suspicious scatterlist
> initialization thing I see is commit f9d03f96b988 ("block: improve
> handling of the magic discard payload") which used to have a magic
> hack wrt !bio->bi_vcnt, and that got removed. See __blk_bios_map_sg(),
> now it does __blk_bvec_map_sg() instead.

But that check was only for discard (and discard-like) bios which
had the maic single page that sometimes was unused attached.

For "normal" bios the for_each_segment loop iterates over bi_vcnt,
so it will be ignored anyway.  That being said both I and the lists
got CCed halfway through the thread and I haven't seen the original
report, so I'm not really sure what's going on here anyway.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
  2016-12-22  5:13           ` Dave Chinner
  2016-12-22  5:46             ` Linus Torvalds
@ 2016-12-22  6:28             ` Dave Chinner
  2016-12-22 17:24               ` Linus Torvalds
  1 sibling, 1 reply; 57+ messages in thread
From: Dave Chinner @ 2016-12-22  6:28 UTC (permalink / raw)
  To: Chris Leech, Linus Torvalds, Linux Kernel Mailing List,
	Lee Duncan, open-iscsi, Linux SCSI List, linux-block,
	Christoph Hellwig

On Thu, Dec 22, 2016 at 04:13:22PM +1100, Dave Chinner wrote:
> On Wed, Dec 21, 2016 at 04:13:03PM -0800, Chris Leech wrote:
> > On Wed, Dec 21, 2016 at 03:19:15PM -0800, Linus Torvalds wrote:
> > > Hi,
> > > 
> > > On Wed, Dec 21, 2016 at 2:16 PM, Dave Chinner <david@fromorbit.com> wrote:
> > > > On Fri, Dec 16, 2016 at 10:59:06AM -0800, Chris Leech wrote:
> > > >> Thanks Dave,
> > > >>
> > > >> I'm hitting a bug at scatterlist.h:140 before I even get any iSCSI
> > > >> modules loaded (virtio block) so there's something else going on in the
> > > >> current merge window.  I'll keep an eye on it and make sure there's
> > > >> nothing iSCSI needs fixing for.
> > > >
> > > > OK, so before this slips through the cracks.....
> > > >
> > > > Linus - your tree as of a few minutes ago still panics immediately
> > > > when starting xfstests on iscsi devices. It appears to be a
> > > > scatterlist corruption and not an iscsi problem, so the iscsi guys
> > > > seem to have bounced it and no-one is looking at it.
> > > 
> > > Hmm. There's not much to go by.
> > > 
> > > Can somebody in iscsi-land please try to just bisect it - I'm not
> > > seeing a lot of clues to where this comes from otherwise.
> > 
> > Yeah, my hopes of this being quickly resolved by someone else didn't
> > work out and whatever is going on in that test VM is looking like a
> > different kind of odd.  I'm saving that off for later, and seeing if I
> > can't be a bisect on the iSCSI issue.
> 
> There may be deeper issues. I just started running scalability tests
> (e.g. 16-way fsmark create tests) and about a minute in I got a
> directory corruption reported - something I hadn't seen in the dev
> cycle at all. I unmounted the fs, mkfs'd it again, ran the
> workload again and about a minute in this fired:
> 
> [628867.607417] ------------[ cut here ]------------
> [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 shadow_lru_isolate+0x171/0x220
> [628867.610702] Modules linked in:
> [628867.611375] CPU: 2 PID: 16925 Comm: kworker/2:97 Tainted: G        W       4.9.0-dgc #18
> [628867.613382] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
> [628867.616179] Workqueue: events rht_deferred_worker
> [628867.632422] Call Trace:
> [628867.634691]  dump_stack+0x63/0x83
> [628867.637937]  __warn+0xcb/0xf0
> [628867.641359]  warn_slowpath_null+0x1d/0x20
> [628867.643362]  shadow_lru_isolate+0x171/0x220
> [628867.644627]  __list_lru_walk_one.isra.11+0x79/0x110
> [628867.645780]  ? __list_lru_init+0x70/0x70
> [628867.646628]  list_lru_walk_one+0x17/0x20
> [628867.647488]  scan_shadow_nodes+0x34/0x50
> [628867.648358]  shrink_slab.part.65.constprop.86+0x1dc/0x410
> [628867.649506]  shrink_node+0x57/0x90
> [628867.650233]  do_try_to_free_pages+0xdd/0x230
> [628867.651157]  try_to_free_pages+0xce/0x1a0
> [628867.652342]  __alloc_pages_slowpath+0x2df/0x960
> [628867.653332]  ? __might_sleep+0x4a/0x80
> [628867.654148]  __alloc_pages_nodemask+0x24b/0x290
> [628867.655237]  kmalloc_order+0x21/0x50
> [628867.656016]  kmalloc_order_trace+0x24/0xc0
> [628867.656878]  __kmalloc+0x17d/0x1d0
> [628867.657644]  bucket_table_alloc+0x195/0x1d0
> [628867.658564]  ? __might_sleep+0x4a/0x80
> [628867.659449]  rht_deferred_worker+0x287/0x3c0
> [628867.660366]  ? _raw_spin_unlock_irq+0xe/0x30
> [628867.661294]  process_one_work+0x1de/0x4d0
> [628867.662208]  worker_thread+0x4b/0x4f0
> [628867.662990]  kthread+0x10c/0x140
> [628867.663687]  ? process_one_work+0x4d0/0x4d0
> [628867.664564]  ? kthread_create_on_node+0x40/0x40
> [628867.665523]  ret_from_fork+0x25/0x30
> [628867.666317] ---[ end trace 7c38634006a9955e ]---
> 
> Now, this workload does not touch the page cache at all - it's
> entirely an XFS metadata workload, so it should not really be
> affecting the working set code.

The system back up, and I haven't reproduced this problem yet.
However, benchmark results are way off where they should be, and at
times the performance is utterly abysmal. The XFS for-next tree
based on the 4.9 kernel shows none of these problems, so I don't
think there's an XFS problem here. Workload is the same 16-way
fsmark workload that I've been using for years as a performance
regression test.

The workload normally averages around 230k files/s - i'm seeing
and average of ~175k files/s on you current kernel. And there are
periods where performance just completely tanks:

#  ./fs_mark  -D  10000  -S0  -n  100000  -s  0  -L  32  -d  /mnt/scratch/0  -d  /mnt/scratch/1  -d  /mnt/scratch/2  -d  /mnt/scratch/3  -d  /mnt/scratch/4  -d  /mnt/scratch/5  -d  /mnt/scratch/6  -d  /mnt/scratch/7  -d  /mnt/scratch/8  -d  /mnt/scratch/9  -d  /mnt/scratch/10  -d  /mnt/scratch/11  -d  /mnt/scratch/12  -d  /mnt/scratch/13  -d  /mnt/scratch/14  -d  /mnt/scratch/15
#       Version 3.3, 16 thread(s) starting at Thu Dec 22 16:29:20 2016
#       Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
#       Directories:  Time based hash between directories across 10000 subdirectories with 180 seconds per subdirectory.
#       File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
#       Files info: size 0 bytes, written with an IO size of 16384 bytes per write
#       App overhead is time in microseconds spent in the test not doing file writing related system calls.

FSUse%        Count         Size    Files/sec     App Overhead
     0      1600000            0     256964.5         10696017
     0      3200000            0     244151.9         12129380
     0      4800000            0     239929.1         11724413
     0      6400000            0     235646.5         11998954
     0      8000000            0     162537.3         15027053
     0      9600000            0     186597.4         17957243
.....
     0     43200000            0     184972.6         27911543
     0     44800000            0     141642.2         62862142
     0     46400000            0     137700.8         39018008
     0     48000000            0      92155.0         38277234
     0     49600000            0      57053.8         27810215
     0     51200000            0     178941.6         33321543

This sort of thing is normally indicative of a memory reclaim or
lock contention problem. Profile showed unusual spinlock contention,
but then I realised there was only one kswapd thread running.
Yup, sure enough, it's caused by a major change in memory reclaim
behaviour:

[    0.000000] Zone ranges:
[    0.000000]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
[    0.000000]   DMA32    [mem 0x0000000001000000-0x00000000ffffffff]
[    0.000000]   Normal   [mem 0x0000000100000000-0x000000083fffffff]
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000000001000-0x000000000009efff]
[    0.000000]   node   0: [mem 0x0000000000100000-0x00000000bffdefff]
[    0.000000]   node   0: [mem 0x0000000100000000-0x00000003bfffffff]
[    0.000000]   node   0: [mem 0x00000005c0000000-0x00000005ffffffff]
[    0.000000]   node   0: [mem 0x0000000800000000-0x000000083fffffff]
[    0.000000] Initmem setup node 0 [mem 0x0000000000001000-0x000000083fffffff]

the numa=fake=4 CLI option is broken.

And, just to pile another one in - there's a massive xfs_repair
performance regression, too. It normally takes 4m30s to run over the
50 million inode filesystem created by fsmark. on 4.10:

        XFS_REPAIR Summary    Thu Dec 22 16:43:58 2016

Phase           Start           End             Duration
Phase 1:        12/22 16:35:36  12/22 16:35:38  2 seconds
Phase 2:        12/22 16:35:38  12/22 16:35:40  2 seconds
Phase 3:        12/22 16:35:40  12/22 16:39:00  3 minutes, 20 seconds
Phase 4:        12/22 16:39:00  12/22 16:40:40  1 minute, 40 seconds
Phase 5:        12/22 16:40:40  12/22 16:40:40  
Phase 6:        12/22 16:40:40  12/22 16:43:57  3 minutes, 17 seconds
Phase 7:        12/22 16:43:57  12/22 16:43:57  

Total run time: 8 minutes, 21 seconds

So, yeah, there's lots of bad stuff I'm tripping over right now.
This is all now somebody els's problem to deal with - by this time
tomorrow I'm outta here and won't be back for several months....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
  2016-12-22  6:18         ` Christoph Hellwig
@ 2016-12-22  6:30           ` Dave Chinner
  2016-12-22  6:36             ` Christoph Hellwig
  0 siblings, 1 reply; 57+ messages in thread
From: Dave Chinner @ 2016-12-22  6:30 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Linus Torvalds, Chris Leech, Linux Kernel Mailing List,
	Lee Duncan, open-iscsi, Linux SCSI List, linux-block

On Thu, Dec 22, 2016 at 07:18:27AM +0100, Christoph Hellwig wrote:
> On Wed, Dec 21, 2016 at 03:19:15PM -0800, Linus Torvalds wrote:
> > Looking around a bit, the only even halfway suspicious scatterlist
> > initialization thing I see is commit f9d03f96b988 ("block: improve
> > handling of the magic discard payload") which used to have a magic
> > hack wrt !bio->bi_vcnt, and that got removed. See __blk_bios_map_sg(),
> > now it does __blk_bvec_map_sg() instead.
> 
> But that check was only for discard (and discard-like) bios which
> had the maic single page that sometimes was unused attached.
> 
> For "normal" bios the for_each_segment loop iterates over bi_vcnt,
> so it will be ignored anyway.  That being said both I and the lists
> got CCed halfway through the thread and I haven't seen the original
> report, so I'm not really sure what's going on here anyway.

http://www.gossamer-threads.com/lists/linux/kernel/2587485

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
  2016-12-22  6:30           ` Dave Chinner
@ 2016-12-22  6:36             ` Christoph Hellwig
  0 siblings, 0 replies; 57+ messages in thread
From: Christoph Hellwig @ 2016-12-22  6:36 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Linus Torvalds, Chris Leech,
	Linux Kernel Mailing List, Lee Duncan, open-iscsi,
	Linux SCSI List, linux-block

On Thu, Dec 22, 2016 at 05:30:46PM +1100, Dave Chinner wrote:
> > For "normal" bios the for_each_segment loop iterates over bi_vcnt,
> > so it will be ignored anyway.  That being said both I and the lists
> > got CCed halfway through the thread and I haven't seen the original
> > report, so I'm not really sure what's going on here anyway.
> 
> http://www.gossamer-threads.com/lists/linux/kernel/2587485

This doesn't look like the discard changes, but if Chris wants to test
without them f9d03f96b988 reverts cleanly.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
  2016-12-22  5:46             ` Linus Torvalds
@ 2016-12-22  6:50               ` Dave Chinner
  2016-12-22 18:50                 ` Chris Leech
  2016-12-22 20:22               ` Hugh Dickins
  1 sibling, 1 reply; 57+ messages in thread
From: Dave Chinner @ 2016-12-22  6:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Johannes Weiner, Chris Leech, Linux Kernel Mailing List,
	Lee Duncan, open-iscsi, Linux SCSI List, linux-block,
	Christoph Hellwig

On Wed, Dec 21, 2016 at 09:46:37PM -0800, Linus Torvalds wrote:
> On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner <david@fromorbit.com> wrote:
> >
> > There may be deeper issues. I just started running scalability tests
> > (e.g. 16-way fsmark create tests) and about a minute in I got a
> > directory corruption reported - something I hadn't seen in the dev
> > cycle at all.
> 
> By "in the dev cycle", do you mean your XFS changes, or have you been
> tracking the merge cycle at least for some testing?

I mean the three months leading up to the 4.10 merge, when all the
XFS changes were being tested against 4.9-rc kernels.

The iscsi problem showed up when I updated the base kernel from
4.9 to 4.10-current last week to test the pullreq I was going to
send you. I've been bust with other stuff until now, so I didn't
upgrade my working trees again until today in the hope the iscsi
problem had already been found and fixed.

> > I unmounted the fs, mkfs'd it again, ran the
> > workload again and about a minute in this fired:
> >
> > [628867.607417] ------------[ cut here ]------------
> > [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 shadow_lru_isolate+0x171/0x220
> 
> Well, part of the changes during the merge window were the shadow
> entry tracking changes that came in through Andrew's tree. Adding
> Johannes Weiner to the participants.
> 
> > Now, this workload does not touch the page cache at all - it's
> > entirely an XFS metadata workload, so it should not really be
> > affecting the working set code.
> 
> Well, I suspect that anything that creates memory pressure will end up
> triggering the working set code, so ..
> 
> That said, obviously memory corruption could be involved and result in
> random issues too, but I wouldn't really expect that in this code.
> 
> It would probably be really useful to get more data points - is the
> problem reliably in this area, or is it going to be random and all
> over the place.

The iscsi problem is 100% reproducable. create a pair of iscsi luns,
mkfs, run xfstests on them. iscsi fails a second after xfstests mounts
the filesystems.

The test machine I'm having all these other problems on? stable and
steady as a rock using PMEM devices. Moment I go to use /dev/vdc
(i.e. run load/perf benchmarks) it starts falling over left, right
and center.

And I just smacked into this in the bulkstat phase of the benchmark
(mkfs, fsmark, xfs_repair, mount, bulkstat, find, grep, rm):

[ 2729.750563] BUG: Bad page state in process bstat  pfn:14945
[ 2729.751863] page:ffffea0000525140 count:-1 mapcount:0 mapping:          (null) index:0x0
[ 2729.753763] flags: 0x4000000000000000()
[ 2729.754671] raw: 4000000000000000 0000000000000000 0000000000000000 ffffffffffffffff
[ 2729.756469] raw: dead000000000100 dead000000000200 0000000000000000 0000000000000000
[ 2729.758276] page dumped because: nonzero _refcount
[ 2729.759393] Modules linked in:
[ 2729.760137] CPU: 7 PID: 25902 Comm: bstat Tainted: G    B           4.9.0-dgc #18
[ 2729.761888] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
[ 2729.763943] Call Trace:
[ 2729.764523]  <IRQ>
[ 2729.765004]  dump_stack+0x63/0x83
[ 2729.765784]  bad_page+0xc4/0x130
[ 2729.766552]  free_pages_check_bad+0x4f/0x70
[ 2729.767531]  free_pcppages_bulk+0x3c5/0x3d0
[ 2729.768513]  ? page_alloc_cpu_dead+0x30/0x30
[ 2729.769510]  drain_pages_zone+0x41/0x60
[ 2729.770417]  drain_pages+0x3e/0x60
[ 2729.771215]  drain_local_pages+0x24/0x30
[ 2729.772138]  flush_smp_call_function_queue+0x88/0x160
[ 2729.773317]  generic_smp_call_function_single_interrupt+0x13/0x30
[ 2729.774742]  smp_call_function_single_interrupt+0x27/0x40
[ 2729.776000]  smp_call_function_interrupt+0xe/0x10
[ 2729.777102]  call_function_interrupt+0x8e/0xa0
[ 2729.778147] RIP: 0010:delay_tsc+0x41/0x90
[ 2729.779085] RSP: 0018:ffffc9000f0cf500 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff03
[ 2729.780852] RAX: 0000000077541291 RBX: ffff88008b5efe40 RCX: 000000000000002e
[ 2729.782514] RDX: 0000057700000000 RSI: 0000057777541291 RDI: 0000000000000001
[ 2729.784167] RBP: ffffc9000f0cf500 R08: 0000000000000007 R09: ffffc9000f0cf678
[ 2729.785818] R10: 0000000000000006 R11: 0000000000001000 R12: 0000000000000061
[ 2729.787480] R13: 0000000000000001 R14: 0000000083214e30 R15: 0000000000000080
[ 2729.789124]  </IRQ>
[ 2729.789626]  __delay+0xf/0x20
[ 2729.790333]  do_raw_spin_lock+0x8c/0x160
[ 2729.791255]  _raw_spin_lock+0x15/0x20
[ 2729.792112]  list_lru_add+0x1a/0x70
[ 2729.792932]  xfs_buf_rele+0x3e7/0x410
[ 2729.793792]  xfs_buftarg_shrink_scan+0x6b/0x80
[ 2729.794841]  shrink_slab.part.65.constprop.86+0x1dc/0x410
[ 2729.796099]  shrink_node+0x57/0x90
[ 2729.796905]  do_try_to_free_pages+0xdd/0x230
[ 2729.797914]  try_to_free_pages+0xce/0x1a0
[ 2729.798852]  __alloc_pages_slowpath+0x2df/0x960
[ 2729.799908]  __alloc_pages_nodemask+0x24b/0x290
[ 2729.800963]  new_slab+0x2ac/0x380
[ 2729.801743]  ___slab_alloc.constprop.82+0x336/0x440
[ 2729.802890]  ? kmem_zone_alloc+0x81/0x120
[ 2729.803830]  ? xfs_buf_read_map+0x2d/0x1a0
[ 2729.804785]  ? kmem_zone_alloc+0x81/0x120
[ 2729.805723]  __slab_alloc.isra.78.constprop.81+0x2b/0x40
[ 2729.806970]  kmem_cache_alloc+0x142/0x180
[ 2729.807905]  kmem_zone_alloc+0x81/0x120
[ 2729.808807]  xfs_inode_alloc+0x29/0x1d0
[ 2729.809707]  xfs_iget+0x3cd/0x9d0
[ 2729.810496]  ? xfs_iflush+0x350/0x350
[ 2729.811358]  xfs_bulkstat_one_int+0x87/0x370
[ 2729.812359]  xfs_bulkstat_one+0x20/0x30
[ 2729.813259]  xfs_bulkstat+0x485/0x690
[ 2729.814125]  ? xfs_bulkstat_one_int+0x370/0x370
[ 2729.815183]  xfs_ioc_bulkstat+0xfa/0x1d0
[ 2729.816105]  xfs_file_ioctl+0x8e4/0xc70
[ 2729.817006]  ? _raw_spin_unlock_irq+0xe/0x30
[ 2729.818011]  ? finish_task_switch+0xab/0x240
[ 2729.819007]  ? __schedule+0x265/0x760
[ 2729.819866]  ? _raw_spin_unlock+0xe/0x20
[ 2729.820783]  ? preempt_schedule_irq+0x59/0xb0
[ 2729.821797]  ? retint_kernel+0x1b/0x1d
[ 2729.822685]  do_vfs_ioctl+0x94/0x700
[ 2729.823523]  ? __fget+0x77/0xb0
[ 2729.824263]  SyS_ioctl+0x79/0x90
[ 2729.825022]  entry_SYSCALL_64_fastpath+0x1a/0xa9
[ 2729.826105] RIP: 0033:0x7fd25ff088e7
[ 2729.826943] RSP: 002b:00007fcfc4757e18 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 2729.828714] RAX: ffffffffffffffda RBX: 00007fd0d4000020 RCX: 00007fd25ff088e7
[ 2729.830372] RDX: 00007fcfc4757e60 RSI: ffffffffc0205865 RDI: 0000000000000003
[ 2729.832020] RBP: 0000000000022010 R08: ffffffffffffffff R09: 0000000000000400
[ 2729.833670] R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000023000
[ 2729.835327] R13: 0000000000022010 R14: 0000000000001000 R15: 00007fd0d4000020

> That said:
> 
> > And worse, on that last error, the /host/ is now going into meltdown
> > (running 4.7.5) with 32 CPUs all burning down in ACPI code:
> 
> The obvious question here is how much you trust the environment if the
> host ends up also showing problems. Maybe you do end up having hw
> issues pop up too.

I trust the hardware way more than I trust the software running on
it....

> The primary suspect would presumably be the development kernel you're
> testing triggering something, but it has to be asked..

I've seen a couple of guest panics in RCU related code take down the
parent QEMU process recently - asserts in console handling code IIRC
- but that's the only signs I've ever had of host side problems.
That both the host and guest went whacky in new and interesting ways
at exactly the same time might be coincidence, but Occam's Razor
suggests they are linked in some way...


> 
>                      Linus
> 

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
  2016-12-22  6:28             ` Dave Chinner
@ 2016-12-22 17:24               ` Linus Torvalds
  2016-12-22 20:20                 ` Thomas Gleixner
  2016-12-22 20:42                 ` Dave Chinner
  0 siblings, 2 replies; 57+ messages in thread
From: Linus Torvalds @ 2016-12-22 17:24 UTC (permalink / raw)
  To: Dave Chinner, Thomas Gleixner, Ingo Molnar, Peter Anvin
  Cc: Linux Kernel Mailing List, the arch/x86 maintainers

On Wed, Dec 21, 2016 at 10:28 PM, Dave Chinner <david@fromorbit.com> wrote:
>
> This sort of thing is normally indicative of a memory reclaim or
> lock contention problem. Profile showed unusual spinlock contention,
> but then I realised there was only one kswapd thread running.
> Yup, sure enough, it's caused by a major change in memory reclaim
> behaviour:
>
> [    0.000000] Zone ranges:
> [    0.000000]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
> [    0.000000]   DMA32    [mem 0x0000000001000000-0x00000000ffffffff]
> [    0.000000]   Normal   [mem 0x0000000100000000-0x000000083fffffff]
> [    0.000000] Movable zone start for each node
> [    0.000000] Early memory node ranges
> [    0.000000]   node   0: [mem 0x0000000000001000-0x000000000009efff]
> [    0.000000]   node   0: [mem 0x0000000000100000-0x00000000bffdefff]
> [    0.000000]   node   0: [mem 0x0000000100000000-0x00000003bfffffff]
> [    0.000000]   node   0: [mem 0x00000005c0000000-0x00000005ffffffff]
> [    0.000000]   node   0: [mem 0x0000000800000000-0x000000083fffffff]
> [    0.000000] Initmem setup node 0 [mem 0x0000000000001000-0x000000083fffffff]
>
> the numa=fake=4 CLI option is broken.

Ok, I think that is independent of anything else. Removing block
people and adding the x86 people.

I'm not seeing anything at all that would change the fake numa stuff,
but maybe the cpu hotplug changes?

Thomas/Ingo/Peter - Dave is going away for several months, so you
won't get feedback from him, but can you look at this? Or maybe point
me towards the right people - I'm seeing no possible relevant changes
at all fir x85 numa since 4.9, so it must be some indirect breakage.

Dave is using fake-numa to do performance testing in a VM, and it's a
big deal for the node optimizations for writeback etc. Do you have any
ideas?

Dave, if you're still around, can you send out the kernel config file
you used...

               Linus

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
  2016-12-22  6:50               ` Dave Chinner
@ 2016-12-22 18:50                 ` Chris Leech
  2016-12-22 23:53                   ` Ming Lei
  0 siblings, 1 reply; 57+ messages in thread
From: Chris Leech @ 2016-12-22 18:50 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Linus Torvalds, Johannes Weiner, Linux Kernel Mailing List,
	Lee Duncan, open-iscsi, Linux SCSI List, linux-block,
	Christoph Hellwig

On Thu, Dec 22, 2016 at 05:50:12PM +1100, Dave Chinner wrote:
> On Wed, Dec 21, 2016 at 09:46:37PM -0800, Linus Torvalds wrote:
> > On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > There may be deeper issues. I just started running scalability tests
> > > (e.g. 16-way fsmark create tests) and about a minute in I got a
> > > directory corruption reported - something I hadn't seen in the dev
> > > cycle at all.
> > 
> > By "in the dev cycle", do you mean your XFS changes, or have you been
> > tracking the merge cycle at least for some testing?
> 
> I mean the three months leading up to the 4.10 merge, when all the
> XFS changes were being tested against 4.9-rc kernels.
> 
> The iscsi problem showed up when I updated the base kernel from
> 4.9 to 4.10-current last week to test the pullreq I was going to
> send you. I've been bust with other stuff until now, so I didn't
> upgrade my working trees again until today in the hope the iscsi
> problem had already been found and fixed.
> 
> > > I unmounted the fs, mkfs'd it again, ran the
> > > workload again and about a minute in this fired:
> > >
> > > [628867.607417] ------------[ cut here ]------------
> > > [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 shadow_lru_isolate+0x171/0x220
> > 
> > Well, part of the changes during the merge window were the shadow
> > entry tracking changes that came in through Andrew's tree. Adding
> > Johannes Weiner to the participants.
> > 
> > > Now, this workload does not touch the page cache at all - it's
> > > entirely an XFS metadata workload, so it should not really be
> > > affecting the working set code.
> > 
> > Well, I suspect that anything that creates memory pressure will end up
> > triggering the working set code, so ..
> > 
> > That said, obviously memory corruption could be involved and result in
> > random issues too, but I wouldn't really expect that in this code.
> > 
> > It would probably be really useful to get more data points - is the
> > problem reliably in this area, or is it going to be random and all
> > over the place.
> 
> The iscsi problem is 100% reproducable. create a pair of iscsi luns,
> mkfs, run xfstests on them. iscsi fails a second after xfstests mounts
> the filesystems.
> 
> The test machine I'm having all these other problems on? stable and
> steady as a rock using PMEM devices. Moment I go to use /dev/vdc
> (i.e. run load/perf benchmarks) it starts falling over left, right
> and center.

I'm not reproducing any problems with xfstests running over iscsi_tcp
right now.  Two 10G luns exported from an LIO target, attached directly
to a test VM as sda/sdb and xfstests configured to use sda1/sdb1 as
TEST_DEV and SCRATCH_DEV.

The virtio scatterlist issue that popped right away for me is triggered
by an hdparm ioctl, which is being run by tuned on Fedora.  And that
actually seems to happen back on 4.9 as well :(

Chris


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
  2016-12-22 17:24               ` Linus Torvalds
@ 2016-12-22 20:20                 ` Thomas Gleixner
  2016-12-22 20:42                 ` Dave Chinner
  1 sibling, 0 replies; 57+ messages in thread
From: Thomas Gleixner @ 2016-12-22 20:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, Ingo Molnar, Peter Anvin,
	Linux Kernel Mailing List, the arch/x86 maintainers

On Thu, 22 Dec 2016, Linus Torvalds wrote:
> On Wed, Dec 21, 2016 at 10:28 PM, Dave Chinner <david@fromorbit.com> wrote:
> >
> > This sort of thing is normally indicative of a memory reclaim or
> > lock contention problem. Profile showed unusual spinlock contention,
> > but then I realised there was only one kswapd thread running.
> > Yup, sure enough, it's caused by a major change in memory reclaim
> > behaviour:
> >
> > [    0.000000] Zone ranges:
> > [    0.000000]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
> > [    0.000000]   DMA32    [mem 0x0000000001000000-0x00000000ffffffff]
> > [    0.000000]   Normal   [mem 0x0000000100000000-0x000000083fffffff]
> > [    0.000000] Movable zone start for each node
> > [    0.000000] Early memory node ranges
> > [    0.000000]   node   0: [mem 0x0000000000001000-0x000000000009efff]
> > [    0.000000]   node   0: [mem 0x0000000000100000-0x00000000bffdefff]
> > [    0.000000]   node   0: [mem 0x0000000100000000-0x00000003bfffffff]
> > [    0.000000]   node   0: [mem 0x00000005c0000000-0x00000005ffffffff]
> > [    0.000000]   node   0: [mem 0x0000000800000000-0x000000083fffffff]
> > [    0.000000] Initmem setup node 0 [mem 0x0000000000001000-0x000000083fffffff]
> >
> > the numa=fake=4 CLI option is broken.
> 
> Ok, I think that is independent of anything else. Removing block
> people and adding the x86 people.
> 
> I'm not seeing anything at all that would change the fake numa stuff,
> but maybe the cpu hotplug changes?

Nope. Double checked it for correctness. The cpu hotplug code there is not
involved in setting up kswapd threads. They are created by the memory node
stuff.

> Thomas/Ingo/Peter - Dave is going away for several months, so you
> won't get feedback from him, but can you look at this? Or maybe point
> me towards the right people - I'm seeing no possible relevant changes
> at all fir x85 numa since 4.9, so it must be some indirect breakage.
> 
> Dave is using fake-numa to do performance testing in a VM, and it's a
> big deal for the node optimizations for writeback etc. Do you have any
> ideas?

There was the nodeid -> cpuid mapping stuff, but that was in 4.9 so it's
not a post 4.9 wreckage, but maybe it's related nevertheless.

> Dave, if you're still around, can you send out the kernel config file
> you used...

Config file and kernel command line would be helpful.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
  2016-12-22  5:46             ` Linus Torvalds
  2016-12-22  6:50               ` Dave Chinner
@ 2016-12-22 20:22               ` Hugh Dickins
  2016-12-23  7:32                 ` Johannes Weiner
  1 sibling, 1 reply; 57+ messages in thread
From: Hugh Dickins @ 2016-12-22 20:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, Johannes Weiner, Chris Leech,
	Linux Kernel Mailing List, Lee Duncan, open-iscsi,
	Linux SCSI List, linux-block, Christoph Hellwig

On Wed, 21 Dec 2016, Linus Torvalds wrote:
> On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner <david@fromorbit.com> wrote:
> >
> > There may be deeper issues. I just started running scalability tests
> > (e.g. 16-way fsmark create tests) and about a minute in I got a
> > directory corruption reported - something I hadn't seen in the dev
> > cycle at all.
> 
> By "in the dev cycle", do you mean your XFS changes, or have you been
> tracking the merge cycle at least for some testing?
> 
> > I unmounted the fs, mkfs'd it again, ran the
> > workload again and about a minute in this fired:
> >
> > [628867.607417] ------------[ cut here ]------------
> > [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 shadow_lru_isolate+0x171/0x220
> 
> Well, part of the changes during the merge window were the shadow
> entry tracking changes that came in through Andrew's tree. Adding
> Johannes Weiner to the participants.
> 
> > Now, this workload does not touch the page cache at all - it's
> > entirely an XFS metadata workload, so it should not really be
> > affecting the working set code.
> 
> Well, I suspect that anything that creates memory pressure will end up
> triggering the working set code, so ..
> 
> That said, obviously memory corruption could be involved and result in
> random issues too, but I wouldn't really expect that in this code.
> 
> It would probably be really useful to get more data points - is the
> problem reliably in this area, or is it going to be random and all
> over the place.

Data point: kswapd got WARNING on mm/workingset.c:457 in shadow_lru_isolate,
soon followed by NULL pointer deref in list_lru_isolate, one time when
I tried out Sunday's git tree.  Not seen since, I haven't had time to
investigate, just set it aside as something to worry about if it happens
again.  But it looks like shadow_lru_isolate() has issues beyond Dave's
case (I've no XFS and no iscsi), suspect unrelated to his other problems.

Hugh

> 
> That said:
> 
> > And worse, on that last error, the /host/ is now going into meltdown
> > (running 4.7.5) with 32 CPUs all burning down in ACPI code:
> 
> The obvious question here is how much you trust the environment if the
> host ends up also showing problems. Maybe you do end up having hw
> issues pop up too.
> 
> The primary suspect would presumably be the development kernel you're
> testing triggering something, but it has to be asked..
> 
>                      Linus

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
  2016-12-22 17:24               ` Linus Torvalds
  2016-12-22 20:20                 ` Thomas Gleixner
@ 2016-12-22 20:42                 ` Dave Chinner
  2016-12-22 21:06                   ` Dave Chinner
  1 sibling, 1 reply; 57+ messages in thread
From: Dave Chinner @ 2016-12-22 20:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Gleixner, Ingo Molnar, Peter Anvin,
	Linux Kernel Mailing List, the arch/x86 maintainers

On Thu, Dec 22, 2016 at 09:24:12AM -0800, Linus Torvalds wrote:
> On Wed, Dec 21, 2016 at 10:28 PM, Dave Chinner <david@fromorbit.com> wrote:
> >
> > This sort of thing is normally indicative of a memory reclaim or
> > lock contention problem. Profile showed unusual spinlock contention,
> > but then I realised there was only one kswapd thread running.
> > Yup, sure enough, it's caused by a major change in memory reclaim
> > behaviour:
> >
> > [    0.000000] Zone ranges:
> > [    0.000000]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
> > [    0.000000]   DMA32    [mem 0x0000000001000000-0x00000000ffffffff]
> > [    0.000000]   Normal   [mem 0x0000000100000000-0x000000083fffffff]
> > [    0.000000] Movable zone start for each node
> > [    0.000000] Early memory node ranges
> > [    0.000000]   node   0: [mem 0x0000000000001000-0x000000000009efff]
> > [    0.000000]   node   0: [mem 0x0000000000100000-0x00000000bffdefff]
> > [    0.000000]   node   0: [mem 0x0000000100000000-0x00000003bfffffff]
> > [    0.000000]   node   0: [mem 0x00000005c0000000-0x00000005ffffffff]
> > [    0.000000]   node   0: [mem 0x0000000800000000-0x000000083fffffff]
> > [    0.000000] Initmem setup node 0 [mem 0x0000000000001000-0x000000083fffffff]
> >
> > the numa=fake=4 CLI option is broken.
> 
> Ok, I think that is independent of anything else. Removing block
> people and adding the x86 people.
> 
> I'm not seeing anything at all that would change the fake numa stuff,
> but maybe the cpu hotplug changes?
> 
> Thomas/Ingo/Peter - Dave is going away for several months, so you
> won't get feedback from him, but can you look at this? Or maybe point
> me towards the right people - I'm seeing no possible relevant changes
> at all fir x85 numa since 4.9, so it must be some indirect breakage.
> 
> Dave is using fake-numa to do performance testing in a VM, and it's a
> big deal for the node optimizations for writeback etc. Do you have any
> ideas?
> 
> Dave, if you're still around, can you send out the kernel config file
> you used...

Looking at this fresh this morning (i.e. not pissed off by having
everything I tried to do fail in different ways all afternoon) I
found this:

$ grep NUMA .config
CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
# CONFIG_NUMA is not set
$

The .config I was using for 4.9 got 'make oldconfig' upgraded, and
looking at it there's a bunch of stuff that has been turned off that
I know was set:

# CONFIG_EXPERT is not set
# CONFIG_PARAVIRT_SPINLOCKS is not set
# CONFIG_COMPACTION is not set

and stuff I never use so don't set was set, like kernel crash dump,
a bunch of stuff for AMD CPUs, susp/resume and power management
debug, every partition type and filesystem under the sun was
selected, heaps of network devices enabled, etc.

So it looks like the problem has occurred during oldconfig, meaning
I have no idea exactly WTF I was testing. Rebuilding now with a
saner config, see what happens.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
  2016-12-22 20:42                 ` Dave Chinner
@ 2016-12-22 21:06                   ` Dave Chinner
  2016-12-22 21:10                     ` Linus Torvalds
  0 siblings, 1 reply; 57+ messages in thread
From: Dave Chinner @ 2016-12-22 21:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Gleixner, Ingo Molnar, Peter Anvin,
	Linux Kernel Mailing List, the arch/x86 maintainers

On Fri, Dec 23, 2016 at 07:42:40AM +1100, Dave Chinner wrote:
> On Thu, Dec 22, 2016 at 09:24:12AM -0800, Linus Torvalds wrote:
> > On Wed, Dec 21, 2016 at 10:28 PM, Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > This sort of thing is normally indicative of a memory reclaim or
> > > lock contention problem. Profile showed unusual spinlock contention,
> > > but then I realised there was only one kswapd thread running.
> > > Yup, sure enough, it's caused by a major change in memory reclaim
> > > behaviour:
> > >
> > > [    0.000000] Zone ranges:
> > > [    0.000000]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
> > > [    0.000000]   DMA32    [mem 0x0000000001000000-0x00000000ffffffff]
> > > [    0.000000]   Normal   [mem 0x0000000100000000-0x000000083fffffff]
> > > [    0.000000] Movable zone start for each node
> > > [    0.000000] Early memory node ranges
> > > [    0.000000]   node   0: [mem 0x0000000000001000-0x000000000009efff]
> > > [    0.000000]   node   0: [mem 0x0000000000100000-0x00000000bffdefff]
> > > [    0.000000]   node   0: [mem 0x0000000100000000-0x00000003bfffffff]
> > > [    0.000000]   node   0: [mem 0x00000005c0000000-0x00000005ffffffff]
> > > [    0.000000]   node   0: [mem 0x0000000800000000-0x000000083fffffff]
> > > [    0.000000] Initmem setup node 0 [mem 0x0000000000001000-0x000000083fffffff]
> > >
> > > the numa=fake=4 CLI option is broken.
> > 
> > Ok, I think that is independent of anything else. Removing block
> > people and adding the x86 people.
> > 
> > I'm not seeing anything at all that would change the fake numa stuff,
> > but maybe the cpu hotplug changes?
> > 
> > Thomas/Ingo/Peter - Dave is going away for several months, so you
> > won't get feedback from him, but can you look at this? Or maybe point
> > me towards the right people - I'm seeing no possible relevant changes
> > at all fir x85 numa since 4.9, so it must be some indirect breakage.
> > 
> > Dave is using fake-numa to do performance testing in a VM, and it's a
> > big deal for the node optimizations for writeback etc. Do you have any
> > ideas?
> > 
> > Dave, if you're still around, can you send out the kernel config file
> > you used...
> 
> Looking at this fresh this morning (i.e. not pissed off by having
> everything I tried to do fail in different ways all afternoon) I
> found this:
> 
> $ grep NUMA .config
> CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
> # CONFIG_NUMA is not set
> $
> 
> The .config I was using for 4.9 got 'make oldconfig' upgraded, and
> looking at it there's a bunch of stuff that has been turned off that
> I know was set:
> 
> # CONFIG_EXPERT is not set
> # CONFIG_PARAVIRT_SPINLOCKS is not set
> # CONFIG_COMPACTION is not set
> 
> and stuff I never use so don't set was set, like kernel crash dump,
> a bunch of stuff for AMD CPUs, susp/resume and power management
> debug, every partition type and filesystem under the sun was
> selected, heaps of network devices enabled, etc.
> 
> So it looks like the problem has occurred during oldconfig, meaning
> I have no idea exactly WTF I was testing. Rebuilding now with a
> saner config, see what happens.

Better, but still bad. average files/s is not up to 200k files/s,
so still a good 10-15% off where it should be. xfs_repair is back
down to 10-15% off where it should be, too. bulkstat still fires off
a bad page reference count warning, iscsi still panics immediately.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
  2016-12-22 21:06                   ` Dave Chinner
@ 2016-12-22 21:10                     ` Linus Torvalds
  2016-12-22 22:15                       ` Dave Chinner
  2016-12-23  0:16                       ` Jens Axboe
  0 siblings, 2 replies; 57+ messages in thread
From: Linus Torvalds @ 2016-12-22 21:10 UTC (permalink / raw)
  To: Dave Chinner, Jens Axboe
  Cc: Thomas Gleixner, Ingo Molnar, Peter Anvin,
	Linux Kernel Mailing List, the arch/x86 maintainers

Ok, so the numa issue was a red herring. With that fixed:

On Thu, Dec 22, 2016 at 1:06 PM, Dave Chinner <david@fromorbit.com> wrote:
>
> Better, but still bad. average files/s is not up to 200k files/s,
> so still a good 10-15% off where it should be. xfs_repair is back
> down to 10-15% off where it should be, too. bulkstat still fires off
> a bad page reference count warning, iscsi still panics immediately.

Do you have CONFIG_BLK_WBT enabled, perhaps?

It's new to this merge window, and I'm not convinced it's been tuned.
Particularly for your kinds of fairly extreme IO loads.

              Linus

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
  2016-12-22 21:10                     ` Linus Torvalds
@ 2016-12-22 22:15                       ` Dave Chinner
  2016-12-22 22:33                         ` Dave Chinner
  2016-12-23  0:16                       ` Jens Axboe
  1 sibling, 1 reply; 57+ messages in thread
From: Dave Chinner @ 2016-12-22 22:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Thomas Gleixner, Ingo Molnar, Peter Anvin,
	Linux Kernel Mailing List, the arch/x86 maintainers

On Thu, Dec 22, 2016 at 01:10:19PM -0800, Linus Torvalds wrote:
> Ok, so the numa issue was a red herring. With that fixed:
> 
> On Thu, Dec 22, 2016 at 1:06 PM, Dave Chinner <david@fromorbit.com> wrote:
> >
> > Better, but still bad. average files/s is not up to 200k files/s,
> > so still a good 10-15% off where it should be. xfs_repair is back
> > down to 10-15% off where it should be, too. bulkstat still fires off
> > a bad page reference count warning, iscsi still panics immediately.
> 
> Do you have CONFIG_BLK_WBT enabled, perhaps?

Ok, yes, that's enabled. Let me go turn it off and see what happens.

> It's new to this merge window, and I'm not convinced it's been tuned.
> Particularly for your kinds of fairly extreme IO loads.

Well, these aren't extreme IO workloads. Yes, the filesystem does a
lot of work on the CPU, but the IO loads XFS generates aren't
particularly high - maybe 2-3000 IOPS at most, and mostly the
bandwidth is below 100MB/s.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
  2016-12-22 22:15                       ` Dave Chinner
@ 2016-12-22 22:33                         ` Dave Chinner
  2016-12-23  3:52                           ` Dave Chinner
  0 siblings, 1 reply; 57+ messages in thread
From: Dave Chinner @ 2016-12-22 22:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Thomas Gleixner, Ingo Molnar, Peter Anvin,
	Linux Kernel Mailing List, the arch/x86 maintainers

On Fri, Dec 23, 2016 at 09:15:00AM +1100, Dave Chinner wrote:
> On Thu, Dec 22, 2016 at 01:10:19PM -0800, Linus Torvalds wrote:
> > Ok, so the numa issue was a red herring. With that fixed:
> > 
> > On Thu, Dec 22, 2016 at 1:06 PM, Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > Better, but still bad. average files/s is not up to 200k files/s,
> > > so still a good 10-15% off where it should be. xfs_repair is back
> > > down to 10-15% off where it should be, too. bulkstat still fires off
> > > a bad page reference count warning, iscsi still panics immediately.
> > 
> > Do you have CONFIG_BLK_WBT enabled, perhaps?
> 
> Ok, yes, that's enabled. Let me go turn it off and see what happens.

Numbers are still all over the place.

FSUse%        Count         Size    Files/sec     App Overhead
.....
     0     28800000            0     228175.5         21928812
     0     30400000            0     167880.5         39628229
     0     32000000            0     124289.5         41420925
     0     33600000            0     150577.9         35382318
     0     35200000            0     216535.4         16072628
     0     36800000            0     233414.4         11846654
     0     38400000            0     213812.0         13356633
     0     40000000            0     175905.7         53012015
     0     41600000            0     157028.7         34700794
     0     43200000            0     138829.1         50282461

And the average is now back down to 185k files/s. repair runtime is
unchanged and still 10-15% off...

I've got to run away for a few hours right now, but I'll retest the
4.9 + xfs for-next branch when I get back to see if the problem is
my curent config or whether there really is a perf problem lurking
somewhere....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
  2016-12-22 18:50                 ` Chris Leech
@ 2016-12-22 23:53                   ` Ming Lei
  2016-12-23  0:03                       ` Chris Leech
  0 siblings, 1 reply; 57+ messages in thread
From: Ming Lei @ 2016-12-22 23:53 UTC (permalink / raw)
  To: Chris Leech, Dave Chinner, Linus Torvalds, Johannes Weiner,
	Linux Kernel Mailing List, Lee Duncan, open-iscsi,
	Linux SCSI List, linux-block, Christoph Hellwig

On Fri, Dec 23, 2016 at 2:50 AM, Chris Leech <cleech@redhat.com> wrote:
> On Thu, Dec 22, 2016 at 05:50:12PM +1100, Dave Chinner wrote:
>> On Wed, Dec 21, 2016 at 09:46:37PM -0800, Linus Torvalds wrote:
>> > On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > >
>> > > There may be deeper issues. I just started running scalability tests
>> > > (e.g. 16-way fsmark create tests) and about a minute in I got a
>> > > directory corruption reported - something I hadn't seen in the dev
>> > > cycle at all.
>> >
>> > By "in the dev cycle", do you mean your XFS changes, or have you been
>> > tracking the merge cycle at least for some testing?
>>
>> I mean the three months leading up to the 4.10 merge, when all the
>> XFS changes were being tested against 4.9-rc kernels.
>>
>> The iscsi problem showed up when I updated the base kernel from
>> 4.9 to 4.10-current last week to test the pullreq I was going to
>> send you. I've been bust with other stuff until now, so I didn't
>> upgrade my working trees again until today in the hope the iscsi
>> problem had already been found and fixed.
>>
>> > > I unmounted the fs, mkfs'd it again, ran the
>> > > workload again and about a minute in this fired:
>> > >
>> > > [628867.607417] ------------[ cut here ]------------
>> > > [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 shadow_lru_isolate+0x171/0x220
>> >
>> > Well, part of the changes during the merge window were the shadow
>> > entry tracking changes that came in through Andrew's tree. Adding
>> > Johannes Weiner to the participants.
>> >
>> > > Now, this workload does not touch the page cache at all - it's
>> > > entirely an XFS metadata workload, so it should not really be
>> > > affecting the working set code.
>> >
>> > Well, I suspect that anything that creates memory pressure will end up
>> > triggering the working set code, so ..
>> >
>> > That said, obviously memory corruption could be involved and result in
>> > random issues too, but I wouldn't really expect that in this code.
>> >
>> > It would probably be really useful to get more data points - is the
>> > problem reliably in this area, or is it going to be random and all
>> > over the place.
>>
>> The iscsi problem is 100% reproducable. create a pair of iscsi luns,
>> mkfs, run xfstests on them. iscsi fails a second after xfstests mounts
>> the filesystems.
>>
>> The test machine I'm having all these other problems on? stable and
>> steady as a rock using PMEM devices. Moment I go to use /dev/vdc
>> (i.e. run load/perf benchmarks) it starts falling over left, right
>> and center.
>
> I'm not reproducing any problems with xfstests running over iscsi_tcp
> right now.  Two 10G luns exported from an LIO target, attached directly
> to a test VM as sda/sdb and xfstests configured to use sda1/sdb1 as
> TEST_DEV and SCRATCH_DEV.
>
> The virtio scatterlist issue that popped right away for me is triggered
> by an hdparm ioctl, which is being run by tuned on Fedora.  And that
> actually seems to happen back on 4.9 as well :(

Could you share us what the specific hdparm cmd line is? I tried several
random cmds over virtio-blk/virito-scsi, looks not see this problem.


Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
@ 2016-12-23  0:03                       ` Chris Leech
  0 siblings, 0 replies; 57+ messages in thread
From: Chris Leech @ 2016-12-23  0:03 UTC (permalink / raw)
  To: Ming Lei
  Cc: Dave Chinner, Linus Torvalds, Johannes Weiner,
	Linux Kernel Mailing List, Lee Duncan, open-iscsi,
	Linux SCSI List, linux-block, Christoph Hellwig

On Fri, Dec 23, 2016 at 07:53:50AM +0800, Ming Lei wrote:
> On Fri, Dec 23, 2016 at 2:50 AM, Chris Leech <cleech@redhat.com> wrote:
> > I'm not reproducing any problems with xfstests running over iscsi_tcp
> > right now.  Two 10G luns exported from an LIO target, attached directly
> > to a test VM as sda/sdb and xfstests configured to use sda1/sdb1 as
> > TEST_DEV and SCRATCH_DEV.
> >
> > The virtio scatterlist issue that popped right away for me is triggered
> > by an hdparm ioctl, which is being run by tuned on Fedora.  And that
> > actually seems to happen back on 4.9 as well :(
> 
> Could you share us what the specific hdparm cmd line is? I tried several
> random cmds over virtio-blk/virito-scsi, looks not see this problem.

Of course, looks like I've screwed up my bisect run on this so I'm still
taking a look.  It triggers for me with 'hdparm -B /dev/vda' but may
also depend on kernel configuration.

I started with the fedora rawhide config with a lot of debug on and
trimed it down with a localmodconfig run in the VM to speed up rebuilds.

Chris

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
@ 2016-12-23  0:03                       ` Chris Leech
  0 siblings, 0 replies; 57+ messages in thread
From: Chris Leech @ 2016-12-23  0:03 UTC (permalink / raw)
  To: Ming Lei
  Cc: Dave Chinner, Linus Torvalds, Johannes Weiner,
	Linux Kernel Mailing List, Lee Duncan,
	open-iscsi-/JYPxA39Uh5TLH3MbocFFw, Linux SCSI List, linux-block,
	Christoph Hellwig

On Fri, Dec 23, 2016 at 07:53:50AM +0800, Ming Lei wrote:
> On Fri, Dec 23, 2016 at 2:50 AM, Chris Leech <cleech-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > I'm not reproducing any problems with xfstests running over iscsi_tcp
> > right now.  Two 10G luns exported from an LIO target, attached directly
> > to a test VM as sda/sdb and xfstests configured to use sda1/sdb1 as
> > TEST_DEV and SCRATCH_DEV.
> >
> > The virtio scatterlist issue that popped right away for me is triggered
> > by an hdparm ioctl, which is being run by tuned on Fedora.  And that
> > actually seems to happen back on 4.9 as well :(
> 
> Could you share us what the specific hdparm cmd line is? I tried several
> random cmds over virtio-blk/virito-scsi, looks not see this problem.

Of course, looks like I've screwed up my bisect run on this so I'm still
taking a look.  It triggers for me with 'hdparm -B /dev/vda' but may
also depend on kernel configuration.

I started with the fedora rawhide config with a lot of debug on and
trimed it down with a localmodconfig run in the VM to speed up rebuilds.

Chris

-- 
You received this message because you are subscribed to the Google Groups "open-iscsi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to open-iscsi+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to open-iscsi-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
Visit this group at https://groups.google.com/group/open-iscsi.
For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
  2016-12-22 21:10                     ` Linus Torvalds
  2016-12-22 22:15                       ` Dave Chinner
@ 2016-12-23  0:16                       ` Jens Axboe
  1 sibling, 0 replies; 57+ messages in thread
From: Jens Axboe @ 2016-12-23  0:16 UTC (permalink / raw)
  To: Linus Torvalds, Dave Chinner
  Cc: Thomas Gleixner, Ingo Molnar, Peter Anvin,
	Linux Kernel Mailing List, the arch/x86 maintainers

On 12/22/2016 02:10 PM, Linus Torvalds wrote:
> Ok, so the numa issue was a red herring. With that fixed:
> 
> On Thu, Dec 22, 2016 at 1:06 PM, Dave Chinner <david@fromorbit.com> wrote:
>>
>> Better, but still bad. average files/s is not up to 200k files/s,
>> so still a good 10-15% off where it should be. xfs_repair is back
>> down to 10-15% off where it should be, too. bulkstat still fires off
>> a bad page reference count warning, iscsi still panics immediately.
> 
> Do you have CONFIG_BLK_WBT enabled, perhaps?
> 
> It's new to this merge window, and I'm not convinced it's been tuned.
> Particularly for your kinds of fairly extreme IO loads.

We have it running on thousands of machines at Facebook, and I've used
it in all of my testing, haven't seen any issues. It's been run pretty
damn hard.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
  2016-12-22 22:33                         ` Dave Chinner
@ 2016-12-23  3:52                           ` Dave Chinner
  0 siblings, 0 replies; 57+ messages in thread
From: Dave Chinner @ 2016-12-23  3:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Thomas Gleixner, Ingo Molnar, Peter Anvin,
	Linux Kernel Mailing List, the arch/x86 maintainers

On Fri, Dec 23, 2016 at 09:33:36AM +1100, Dave Chinner wrote:
> On Fri, Dec 23, 2016 at 09:15:00AM +1100, Dave Chinner wrote:
> > On Thu, Dec 22, 2016 at 01:10:19PM -0800, Linus Torvalds wrote:
> > > Ok, so the numa issue was a red herring. With that fixed:
> > > 
> > > On Thu, Dec 22, 2016 at 1:06 PM, Dave Chinner <david@fromorbit.com> wrote:
> > > >
> > > > Better, but still bad. average files/s is not up to 200k files/s,
> > > > so still a good 10-15% off where it should be. xfs_repair is back
> > > > down to 10-15% off where it should be, too. bulkstat still fires off
> > > > a bad page reference count warning, iscsi still panics immediately.
> > > 
> > > Do you have CONFIG_BLK_WBT enabled, perhaps?
> > 
> > Ok, yes, that's enabled. Let me go turn it off and see what happens.
> 
> Numbers are still all over the place.
> 
> FSUse%        Count         Size    Files/sec     App Overhead
> .....
>      0     28800000            0     228175.5         21928812
>      0     30400000            0     167880.5         39628229
>      0     32000000            0     124289.5         41420925
>      0     33600000            0     150577.9         35382318
>      0     35200000            0     216535.4         16072628
>      0     36800000            0     233414.4         11846654
>      0     38400000            0     213812.0         13356633
>      0     40000000            0     175905.7         53012015
>      0     41600000            0     157028.7         34700794
>      0     43200000            0     138829.1         50282461
> 
> And the average is now back down to 185k files/s. repair runtime is
> unchanged and still 10-15% off...
> 
> I've got to run away for a few hours right now, but I'll retest the
> 4.9 + xfs for-next branch when I get back to see if the problem is
> my curent config or whether there really is a perf problem lurking
> somewhere....

Well, I'm not sure now. Taking that config back to 4.9 gave
results of 210k files/s. A bit faster, but still not the 230k
files/s I'm expecting. So I'm missing something in the .config
at this point, though I'm not sure what.

FWIW, updating from 4.9 to to the 4.10 tree, this happened:

$ $ make oldconfig; make -j 32
scripts/kconfig/conf  --oldconfig Kconfig
*
* Restart config...
*
*
* General setup
*

Yup, it definitely did something unexpected. And almost silently, I
might add - I didn't notice this the first time around, and wouldn't
hav enoticed it this time if I wasn't looking for something strange
to happen.

As iit is, I still haven't found what magic config option is
taking away that 10-15% of performance.  There's no unexpected debug
options set, and it's nothing obvious in the fs/block layer config.
I'll keep looking for the moment...

Cheers,

Dave.


-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
  2016-12-22 20:22               ` Hugh Dickins
@ 2016-12-23  7:32                 ` Johannes Weiner
  2016-12-23  8:33                   ` Johannes Weiner
  0 siblings, 1 reply; 57+ messages in thread
From: Johannes Weiner @ 2016-12-23  7:32 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, Dave Chinner, Chris Leech,
	Linux Kernel Mailing List, Lee Duncan, open-iscsi,
	Linux SCSI List, linux-block, Christoph Hellwig

On Thu, Dec 22, 2016 at 12:22:27PM -0800, Hugh Dickins wrote:
> On Wed, 21 Dec 2016, Linus Torvalds wrote:
> > On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner <david@fromorbit.com> wrote:
> > > I unmounted the fs, mkfs'd it again, ran the
> > > workload again and about a minute in this fired:
> > >
> > > [628867.607417] ------------[ cut here ]------------
> > > [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 shadow_lru_isolate+0x171/0x220
> > 
> > Well, part of the changes during the merge window were the shadow
> > entry tracking changes that came in through Andrew's tree. Adding
> > Johannes Weiner to the participants.
> > 
> > > Now, this workload does not touch the page cache at all - it's
> > > entirely an XFS metadata workload, so it should not really be
> > > affecting the working set code.
> > 
> > Well, I suspect that anything that creates memory pressure will end up
> > triggering the working set code, so ..
> > 
> > That said, obviously memory corruption could be involved and result in
> > random issues too, but I wouldn't really expect that in this code.
> > 
> > It would probably be really useful to get more data points - is the
> > problem reliably in this area, or is it going to be random and all
> > over the place.
> 
> Data point: kswapd got WARNING on mm/workingset.c:457 in shadow_lru_isolate,
> soon followed by NULL pointer deref in list_lru_isolate, one time when
> I tried out Sunday's git tree.  Not seen since, I haven't had time to
> investigate, just set it aside as something to worry about if it happens
> again.  But it looks like shadow_lru_isolate() has issues beyond Dave's
> case (I've no XFS and no iscsi), suspect unrelated to his other problems.

This seems consistent with what Dave observed: we encounter regular
pages in radix tree nodes on the shadow LRU that should only contain
nodes full of exceptional shadow entries. It could be an issue in the
new slot replacement code and the node tracking callback.

Looking...

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
  2016-12-23  7:32                 ` Johannes Weiner
@ 2016-12-23  8:33                   ` Johannes Weiner
  2017-01-02 21:11                       ` Johannes Weiner
  0 siblings, 1 reply; 57+ messages in thread
From: Johannes Weiner @ 2016-12-23  8:33 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, Dave Chinner, Chris Leech,
	Linux Kernel Mailing List, Lee Duncan, open-iscsi,
	Linux SCSI List, linux-block, Christoph Hellwig

On Fri, Dec 23, 2016 at 02:32:41AM -0500, Johannes Weiner wrote:
> On Thu, Dec 22, 2016 at 12:22:27PM -0800, Hugh Dickins wrote:
> > On Wed, 21 Dec 2016, Linus Torvalds wrote:
> > > On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner <david@fromorbit.com> wrote:
> > > > I unmounted the fs, mkfs'd it again, ran the
> > > > workload again and about a minute in this fired:
> > > >
> > > > [628867.607417] ------------[ cut here ]------------
> > > > [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 shadow_lru_isolate+0x171/0x220
> > > 
> > > Well, part of the changes during the merge window were the shadow
> > > entry tracking changes that came in through Andrew's tree. Adding
> > > Johannes Weiner to the participants.
> > > 
> > > > Now, this workload does not touch the page cache at all - it's
> > > > entirely an XFS metadata workload, so it should not really be
> > > > affecting the working set code.
> > > 
> > > Well, I suspect that anything that creates memory pressure will end up
> > > triggering the working set code, so ..
> > > 
> > > That said, obviously memory corruption could be involved and result in
> > > random issues too, but I wouldn't really expect that in this code.
> > > 
> > > It would probably be really useful to get more data points - is the
> > > problem reliably in this area, or is it going to be random and all
> > > over the place.
> > 
> > Data point: kswapd got WARNING on mm/workingset.c:457 in shadow_lru_isolate,
> > soon followed by NULL pointer deref in list_lru_isolate, one time when
> > I tried out Sunday's git tree.  Not seen since, I haven't had time to
> > investigate, just set it aside as something to worry about if it happens
> > again.  But it looks like shadow_lru_isolate() has issues beyond Dave's
> > case (I've no XFS and no iscsi), suspect unrelated to his other problems.
> 
> This seems consistent with what Dave observed: we encounter regular
> pages in radix tree nodes on the shadow LRU that should only contain
> nodes full of exceptional shadow entries. It could be an issue in the
> new slot replacement code and the node tracking callback.

Both encounters seem to indicate use-after-free. Dave's node didn't
warn about an unexpected node->count / node->exceptional state, but
had entries that were inconsistent with that. Hugh got the counter
warning but crashed on a list_head that's not NULLed in a live node.

workingset_update_node() should be called on page cache radix tree
leaf nodes that go empty. I must be missing an update_node callback
where a leaf node gets freed somewhere.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
  2016-12-23  0:03                       ` Chris Leech
  (?)
@ 2016-12-23 10:00                       ` Christoph Hellwig
  2016-12-23 19:42                         ` Linus Torvalds
  -1 siblings, 1 reply; 57+ messages in thread
From: Christoph Hellwig @ 2016-12-23 10:00 UTC (permalink / raw)
  To: Chris Leech, Ming Lei, Dave Chinner, Linus Torvalds,
	Johannes Weiner, Linux Kernel Mailing List, Lee Duncan,
	open-iscsi, Linux SCSI List, linux-block, axboe, mst

On Thu, Dec 22, 2016 at 04:03:56PM -0800, Chris Leech wrote:
> Of course, looks like I've screwed up my bisect run on this so I'm still
> taking a look.  It triggers for me with 'hdparm -B /dev/vda' but may
> also depend on kernel configuration.
> 
> I started with the fedora rawhide config with a lot of debug on and
> trimed it down with a localmodconfig run in the VM to speed up rebuilds.

I think the configuration dependency is CONFIG_HAVE_ARCH_VMAP_STACK,
I've just reproduce the issue with it, and the backtrace points to
__virtblk_add_req when setting up the sense buffer.   And it turns out
that blk_execute_rq tries to do I/O to the on-stack sense buffer.
At least SCSI always has a kmalloced sense buffer, so I guess we'll
need something similar for virtio_blk for now.  For 4.11 I plan to
rework how BLOCK_PC commands work entirely, so hopefull we can make
the sense buffer handling a lot less wasteful.

---
>From 0a77bc424ed907c1e99b4756bb498370b498183a Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <hch@lst.de>
Date: Fri, 23 Dec 2016 10:57:06 +0100
Subject: virtio_blk: avoid DMA to stack for the sense buffer

Most users of BLOCK_PC requests allocate the sense buffer on the stack,
so to avoid DMA to the stack copy them to a field in the heap allocated
virtblk_req structure.  Without that any attempt at SCSI passthrough I/O,
including the SG_IO ioctl from userspace will crash the kernel.  Note that
this includes running tools like hdparm even when the host does not have
SCSI passthrough enabled.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/block/virtio_blk.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 5545a67..3c3b8f6 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -56,6 +56,7 @@ struct virtblk_req {
 	struct virtio_blk_outhdr out_hdr;
 	struct virtio_scsi_inhdr in_hdr;
 	u8 status;
+	u8 sense[SCSI_SENSE_BUFFERSIZE];
 	struct scatterlist sg[];
 };
 
@@ -102,7 +103,8 @@ static int __virtblk_add_req(struct virtqueue *vq,
 	}
 
 	if (type == cpu_to_virtio32(vq->vdev, VIRTIO_BLK_T_SCSI_CMD)) {
-		sg_init_one(&sense, vbr->req->sense, SCSI_SENSE_BUFFERSIZE);
+		memcpy(vbr->sense, vbr->req->sense, SCSI_SENSE_BUFFERSIZE);
+		sg_init_one(&sense, vbr->sense, SCSI_SENSE_BUFFERSIZE);
 		sgs[num_out + num_in++] = &sense;
 		sg_init_one(&inhdr, &vbr->in_hdr, sizeof(vbr->in_hdr));
 		sgs[num_out + num_in++] = &inhdr;
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
  2016-12-23 10:00                       ` Christoph Hellwig
@ 2016-12-23 19:42                         ` Linus Torvalds
  2016-12-24  2:45                             ` Jens Axboe
  2016-12-24 10:07                             ` Christoph Hellwig
  0 siblings, 2 replies; 57+ messages in thread
From: Linus Torvalds @ 2016-12-23 19:42 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Leech, Ming Lei, Dave Chinner, Johannes Weiner,
	Linux Kernel Mailing List, Lee Duncan, open-iscsi,
	Linux SCSI List, linux-block, Jens Axboe, Michael S. Tsirkin

On Fri, Dec 23, 2016 at 2:00 AM, Christoph Hellwig <hch@lst.de> wrote:
>
> From: Christoph Hellwig <hch@lst.de>
> Date: Fri, 23 Dec 2016 10:57:06 +0100
> Subject: virtio_blk: avoid DMA to stack for the sense buffer
>
> Most users of BLOCK_PC requests allocate the sense buffer on the stack,
> so to avoid DMA to the stack copy them to a field in the heap allocated
> virtblk_req structure.  Without that any attempt at SCSI passthrough I/O,
> including the SG_IO ioctl from userspace will crash the kernel.  Note that
> this includes running tools like hdparm even when the host does not have
> SCSI passthrough enabled.

Ugh. This patch is nasty.

I think we should just fix blk_execute_rq() instead.

But from a quick look, we also have at least sg_scsi_ioctl() and
sg_io() doing the same thing.

And the SG_IO thing in bsg_ioctl(). And spi_execute() in scsi_transport_spi.c

And resp_requests() in scsi_debug.c.

So I guess ugly it may need to be, and the rule is that the sense
buffer really can be on the stack and you can't DMA to/from it.
Comments from others?

                Linus

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
  2016-12-23 19:42                         ` Linus Torvalds
@ 2016-12-24  2:45                             ` Jens Axboe
  2016-12-24 10:07                             ` Christoph Hellwig
  1 sibling, 0 replies; 57+ messages in thread
From: Jens Axboe @ 2016-12-24  2:45 UTC (permalink / raw)
  To: Linus Torvalds, Christoph Hellwig
  Cc: Chris Leech, Ming Lei, Dave Chinner, Johannes Weiner,
	Linux Kernel Mailing List, Lee Duncan, open-iscsi,
	Linux SCSI List, linux-block, Michael S. Tsirkin

On 12/23/2016 12:42 PM, Linus Torvalds wrote:
> On Fri, Dec 23, 2016 at 2:00 AM, Christoph Hellwig <hch@lst.de> wrote:
>>
>> From: Christoph Hellwig <hch@lst.de>
>> Date: Fri, 23 Dec 2016 10:57:06 +0100
>> Subject: virtio_blk: avoid DMA to stack for the sense buffer
>>
>> Most users of BLOCK_PC requests allocate the sense buffer on the stack,
>> so to avoid DMA to the stack copy them to a field in the heap allocated
>> virtblk_req structure.  Without that any attempt at SCSI passthrough I/O,
>> including the SG_IO ioctl from userspace will crash the kernel.  Note that
>> this includes running tools like hdparm even when the host does not have
>> SCSI passthrough enabled.
> 
> Ugh. This patch is nasty.
> 
> I think we should just fix blk_execute_rq() instead.
> 
> But from a quick look, we also have at least sg_scsi_ioctl() and
> sg_io() doing the same thing.
> 
> And the SG_IO thing in bsg_ioctl(). And spi_execute() in scsi_transport_spi.c
> 
> And resp_requests() in scsi_debug.c.

It's not that it's technically hard to fix up, it's more that it's a
pain in the ass to have to do it. For instance, for blk_execute_rq(), we
either should enforce that the caller allocates it dynamically and then
free it, or we need nasty hack where the caller needs to know he has to
free it. Pretty obvious what I would prefer there.

And yes, there would be a good chunk of other places where this would
nede to be fixed up...

> So I guess ugly it may need to be, and the rule is that the sense
> buffer really can be on the stack and you can't DMA to/from it.
> Comments from others?

I'm just wondering why this is being hit now, we have a 4.9 release with
this issue and nobody reported it (that I saw)... Which is pretty sad.

If no one beats me to it, I'll try and get a patch done on Sunday. We're
in the midst of the holidays here.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
@ 2016-12-24  2:45                             ` Jens Axboe
  0 siblings, 0 replies; 57+ messages in thread
From: Jens Axboe @ 2016-12-24  2:45 UTC (permalink / raw)
  To: Linus Torvalds, Christoph Hellwig
  Cc: Chris Leech, Ming Lei, Dave Chinner, Johannes Weiner,
	Linux Kernel Mailing List, Lee Duncan, open-iscsi,
	Linux SCSI List, linux-block, Michael S. Tsirkin

On 12/23/2016 12:42 PM, Linus Torvalds wrote:
> On Fri, Dec 23, 2016 at 2:00 AM, Christoph Hellwig <hch@lst.de> wrote:
>>
>> From: Christoph Hellwig <hch@lst.de>
>> Date: Fri, 23 Dec 2016 10:57:06 +0100
>> Subject: virtio_blk: avoid DMA to stack for the sense buffer
>>
>> Most users of BLOCK_PC requests allocate the sense buffer on the stack,
>> so to avoid DMA to the stack copy them to a field in the heap allocated
>> virtblk_req structure.  Without that any attempt at SCSI passthrough I/O,
>> including the SG_IO ioctl from userspace will crash the kernel.  Note that
>> this includes running tools like hdparm even when the host does not have
>> SCSI passthrough enabled.
> 
> Ugh. This patch is nasty.
> 
> I think we should just fix blk_execute_rq() instead.
> 
> But from a quick look, we also have at least sg_scsi_ioctl() and
> sg_io() doing the same thing.
> 
> And the SG_IO thing in bsg_ioctl(). And spi_execute() in scsi_transport_spi.c
> 
> And resp_requests() in scsi_debug.c.

It's not that it's technically hard to fix up, it's more that it's a
pain in the ass to have to do it. For instance, for blk_execute_rq(), we
either should enforce that the caller allocates it dynamically and then
free it, or we need nasty hack where the caller needs to know he has to
free it. Pretty obvious what I would prefer there.

And yes, there would be a good chunk of other places where this would
nede to be fixed up...

> So I guess ugly it may need to be, and the rule is that the sense
> buffer really can be on the stack and you can't DMA to/from it.
> Comments from others?

I'm just wondering why this is being hit now, we have a 4.9 release with
this issue and nobody reported it (that I saw)... Which is pretty sad.

If no one beats me to it, I'll try and get a patch done on Sunday. We're
in the midst of the holidays here.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
@ 2016-12-24  9:49                               ` Christoph Hellwig
  0 siblings, 0 replies; 57+ messages in thread
From: Christoph Hellwig @ 2016-12-24  9:49 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linus Torvalds, Christoph Hellwig, Chris Leech, Ming Lei,
	Dave Chinner, Johannes Weiner, Linux Kernel Mailing List,
	Lee Duncan, open-iscsi, Linux SCSI List, linux-block,
	Michael S. Tsirkin

On Fri, Dec 23, 2016 at 07:45:45PM -0700, Jens Axboe wrote:
> It's not that it's technically hard to fix up, it's more that it's a
> pain in the ass to have to do it. For instance, for blk_execute_rq(), we
> either should enforce that the caller allocates it dynamically and then
> free it, or we need nasty hack where the caller needs to know he has to
> free it. Pretty obvious what I would prefer there.
> 
> And yes, there would be a good chunk of other places where this would
> nede to be fixed up...

My planned rework for the BLOCK_PC code (split all fields for them out
of struct request and move them into a separate, driver-allocate structure)
would fix this up as a side-effect.  I really wanted to get it into 4.10,
but I didn't manage to fix it up.  I'll try to get it into 4.11 early.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
@ 2016-12-24  9:49                               ` Christoph Hellwig
  0 siblings, 0 replies; 57+ messages in thread
From: Christoph Hellwig @ 2016-12-24  9:49 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linus Torvalds, Christoph Hellwig, Chris Leech, Ming Lei,
	Dave Chinner, Johannes Weiner, Linux Kernel Mailing List,
	Lee Duncan, open-iscsi-/JYPxA39Uh5TLH3MbocFFw, Linux SCSI List,
	linux-block, Michael S. Tsirkin

On Fri, Dec 23, 2016 at 07:45:45PM -0700, Jens Axboe wrote:
> It's not that it's technically hard to fix up, it's more that it's a
> pain in the ass to have to do it. For instance, for blk_execute_rq(), we
> either should enforce that the caller allocates it dynamically and then
> free it, or we need nasty hack where the caller needs to know he has to
> free it. Pretty obvious what I would prefer there.
> 
> And yes, there would be a good chunk of other places where this would
> nede to be fixed up...

My planned rework for the BLOCK_PC code (split all fields for them out
of struct request and move them into a separate, driver-allocate structure)
would fix this up as a side-effect.  I really wanted to get it into 4.10,
but I didn't manage to fix it up.  I'll try to get it into 4.11 early.

-- 
You received this message because you are subscribed to the Google Groups "open-iscsi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to open-iscsi+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to open-iscsi-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
Visit this group at https://groups.google.com/group/open-iscsi.
For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
@ 2016-12-24 10:07                             ` Christoph Hellwig
  0 siblings, 0 replies; 57+ messages in thread
From: Christoph Hellwig @ 2016-12-24 10:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christoph Hellwig, Chris Leech, Ming Lei, Dave Chinner,
	Johannes Weiner, Linux Kernel Mailing List, Lee Duncan,
	open-iscsi, Linux SCSI List, linux-block, Jens Axboe,
	Michael S. Tsirkin

On Fri, Dec 23, 2016 at 11:42:45AM -0800, Linus Torvalds wrote:
> Ugh. This patch is nasty.

It's the same SCSI has done for ages - except that is uses a separate
kmalloc for the sense buffer.

> I think we should just fix blk_execute_rq() instead.

As you found out below it's not just blk_execute_rq, it's the whole
architecture of the BLOCK_PC code, which expects a caller provided
sense buffer.  But with the way blk-mq allocates request structures
we can actually fix it, but I first need to extent the way it allows
drivers to allocate private data to the old request code.  I've
actually already implemented that for SCSI long time ago, and have
started to life it to the block layer.

Once that is done the callers won't need a sense buffer at all, and
can just look at the driver provided one.  Which currently is missing
in virtio-blk, so we'd need something similar to the above patch
anyway.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
@ 2016-12-24 10:07                             ` Christoph Hellwig
  0 siblings, 0 replies; 57+ messages in thread
From: Christoph Hellwig @ 2016-12-24 10:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christoph Hellwig, Chris Leech, Ming Lei, Dave Chinner,
	Johannes Weiner, Linux Kernel Mailing List, Lee Duncan,
	open-iscsi-/JYPxA39Uh5TLH3MbocFFw, Linux SCSI List, linux-block,
	Jens Axboe, Michael S. Tsirkin

On Fri, Dec 23, 2016 at 11:42:45AM -0800, Linus Torvalds wrote:
> Ugh. This patch is nasty.

It's the same SCSI has done for ages - except that is uses a separate
kmalloc for the sense buffer.

> I think we should just fix blk_execute_rq() instead.

As you found out below it's not just blk_execute_rq, it's the whole
architecture of the BLOCK_PC code, which expects a caller provided
sense buffer.  But with the way blk-mq allocates request structures
we can actually fix it, but I first need to extent the way it allows
drivers to allocate private data to the old request code.  I've
actually already implemented that for SCSI long time ago, and have
started to life it to the block layer.

Once that is done the callers won't need a sense buffer at all, and
can just look at the driver provided one.  Which currently is missing
in virtio-blk, so we'd need something similar to the above patch
anyway.

-- 
You received this message because you are subscribed to the Google Groups "open-iscsi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to open-iscsi+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to open-iscsi-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
Visit this group at https://groups.google.com/group/open-iscsi.
For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
  2016-12-24 10:07                             ` Christoph Hellwig
  (?)
@ 2016-12-24 13:17                               ` Hannes Reinecke
  -1 siblings, 0 replies; 57+ messages in thread
From: Hannes Reinecke @ 2016-12-24 13:17 UTC (permalink / raw)
  To: Christoph Hellwig, Linus Torvalds
  Cc: Chris Leech, Ming Lei, Dave Chinner, Johannes Weiner,
	Linux Kernel Mailing List, Lee Duncan, open-iscsi,
	Linux SCSI List, linux-block, Jens Axboe, Michael S. Tsirkin

On 12/24/2016 11:07 AM, Christoph Hellwig wrote:
> On Fri, Dec 23, 2016 at 11:42:45AM -0800, Linus Torvalds wrote:
>> Ugh. This patch is nasty.
>
> It's the same SCSI has done for ages - except that is uses a separate
> kmalloc for the sense buffer.
>
>> I think we should just fix blk_execute_rq() instead.
>
> As you found out below it's not just blk_execute_rq, it's the whole
> architecture of the BLOCK_PC code, which expects a caller provided
> sense buffer.  But with the way blk-mq allocates request structures
> we can actually fix it, but I first need to extent the way it allows
> drivers to allocate private data to the old request code.  I've
> actually already implemented that for SCSI long time ago, and have
> started to life it to the block layer.
>
Would be cool to have a generic sense buffer.
I always found it slightly odd, pretending that 'struct request' is 
protocol-agnostic and refusing to add a sense data pointer, but at the 
same time having a field 'sense_len' (which gives the length of what 
exactly?).

Christoph, do you have a pointer to your patchset?
Not that I'll be able to do any meaningful work until next year, but 
having a look would be nice. Just to get a feeling where you want to 
head to; I might be able to work on this start of January.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nï¿½rnberg
GF: J. Hawn, J. Guild, F. Imendï¿½rffer, HRB 16746 (AG Nï¿½rnberg)

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
@ 2016-12-24 13:17                               ` Hannes Reinecke
  0 siblings, 0 replies; 57+ messages in thread
From: Hannes Reinecke @ 2016-12-24 13:17 UTC (permalink / raw)
  To: Christoph Hellwig, Linus Torvalds
  Cc: Chris Leech, Ming Lei, Dave Chinner, Johannes Weiner,
	Linux Kernel Mailing List, Lee Duncan, open-iscsi,
	Linux SCSI List, linux-block, Jens Axboe, Michael S. Tsirkin

On 12/24/2016 11:07 AM, Christoph Hellwig wrote:
> On Fri, Dec 23, 2016 at 11:42:45AM -0800, Linus Torvalds wrote:
>> Ugh. This patch is nasty.
>
> It's the same SCSI has done for ages - except that is uses a separate
> kmalloc for the sense buffer.
>
>> I think we should just fix blk_execute_rq() instead.
>
> As you found out below it's not just blk_execute_rq, it's the whole
> architecture of the BLOCK_PC code, which expects a caller provided
> sense buffer.  But with the way blk-mq allocates request structures
> we can actually fix it, but I first need to extent the way it allows
> drivers to allocate private data to the old request code.  I've
> actually already implemented that for SCSI long time ago, and have
> started to life it to the block layer.
>
Would be cool to have a generic sense buffer.
I always found it slightly odd, pretending that 'struct request' is 
protocol-agnostic and refusing to add a sense data pointer, but at the 
same time having a field 'sense_len' (which gives the length of what 
exactly?).

Christoph, do you have a pointer to your patchset?
Not that I'll be able to do any meaningful work until next year, but 
having a look would be nice. Just to get a feeling where you want to 
head to; I might be able to work on this start of January.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
@ 2016-12-24 13:17                               ` Hannes Reinecke
  0 siblings, 0 replies; 57+ messages in thread
From: Hannes Reinecke @ 2016-12-24 13:17 UTC (permalink / raw)
  To: Christoph Hellwig, Linus Torvalds
  Cc: Chris Leech, Ming Lei, Dave Chinner, Johannes Weiner,
	Linux Kernel Mailing List, Lee Duncan,
	open-iscsi-/JYPxA39Uh5TLH3MbocFFw, Linux SCSI List, linux-block,
	Jens Axboe, Michael S. Tsirkin

On 12/24/2016 11:07 AM, Christoph Hellwig wrote:
> On Fri, Dec 23, 2016 at 11:42:45AM -0800, Linus Torvalds wrote:
>> Ugh. This patch is nasty.
>
> It's the same SCSI has done for ages - except that is uses a separate
> kmalloc for the sense buffer.
>
>> I think we should just fix blk_execute_rq() instead.
>
> As you found out below it's not just blk_execute_rq, it's the whole
> architecture of the BLOCK_PC code, which expects a caller provided
> sense buffer.  But with the way blk-mq allocates request structures
> we can actually fix it, but I first need to extent the way it allows
> drivers to allocate private data to the old request code.  I've
> actually already implemented that for SCSI long time ago, and have
> started to life it to the block layer.
>
Would be cool to have a generic sense buffer.
I always found it slightly odd, pretending that 'struct request' is 
protocol-agnostic and refusing to add a sense data pointer, but at the 
same time having a field 'sense_len' (which gives the length of what 
exactly?).

Christoph, do you have a pointer to your patchset?
Not that I'll be able to do any meaningful work until next year, but 
having a look would be nice. Just to get a feeling where you want to 
head to; I might be able to work on this start of January.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare-l3A5Bk7waGM@public.gmane.org			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)

-- 
You received this message because you are subscribed to the Google Groups "open-iscsi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to open-iscsi+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to open-iscsi-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
Visit this group at https://groups.google.com/group/open-iscsi.
For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
@ 2016-12-24 13:19                                 ` Christoph Hellwig
  0 siblings, 0 replies; 57+ messages in thread
From: Christoph Hellwig @ 2016-12-24 13:19 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Christoph Hellwig, Linus Torvalds, Chris Leech, Ming Lei,
	Dave Chinner, Johannes Weiner, Linux Kernel Mailing List,
	Lee Duncan, open-iscsi, Linux SCSI List, linux-block, Jens Axboe,
	Michael S. Tsirkin

On Sat, Dec 24, 2016 at 02:17:26PM +0100, Hannes Reinecke wrote:
> Christoph, do you have a pointer to your patchset?
> Not that I'll be able to do any meaningful work until next year, but having 
> a look would be nice. Just to get a feeling where you want to head to; I 
> might be able to work on this start of January.

I'll push out a branch once it's revieable and not my current unbisectable
mess, should be soon.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
@ 2016-12-24 13:19                                 ` Christoph Hellwig
  0 siblings, 0 replies; 57+ messages in thread
From: Christoph Hellwig @ 2016-12-24 13:19 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Christoph Hellwig, Linus Torvalds, Chris Leech, Ming Lei,
	Dave Chinner, Johannes Weiner, Linux Kernel Mailing List,
	Lee Duncan, open-iscsi-/JYPxA39Uh5TLH3MbocFFw, Linux SCSI List,
	linux-block, Jens Axboe, Michael S. Tsirkin

On Sat, Dec 24, 2016 at 02:17:26PM +0100, Hannes Reinecke wrote:
> Christoph, do you have a pointer to your patchset?
> Not that I'll be able to do any meaningful work until next year, but having 
> a look would be nice. Just to get a feeling where you want to head to; I 
> might be able to work on this start of January.

I'll push out a branch once it's revieable and not my current unbisectable
mess, should be soon.

-- 
You received this message because you are subscribed to the Google Groups "open-iscsi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to open-iscsi+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to open-iscsi-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
Visit this group at https://groups.google.com/group/open-iscsi.
For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
@ 2017-01-02 21:11                       ` Johannes Weiner
  0 siblings, 0 replies; 57+ messages in thread
From: Johannes Weiner @ 2017-01-02 21:11 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, Dave Chinner, Chris Leech,
	Linux Kernel Mailing List, Lee Duncan, open-iscsi,
	Linux SCSI List, linux-block, Christoph Hellwig, Jan Kara,
	Andrea Arcangeli

On Fri, Dec 23, 2016 at 03:33:29AM -0500, Johannes Weiner wrote:
> On Fri, Dec 23, 2016 at 02:32:41AM -0500, Johannes Weiner wrote:
> > On Thu, Dec 22, 2016 at 12:22:27PM -0800, Hugh Dickins wrote:
> > > On Wed, 21 Dec 2016, Linus Torvalds wrote:
> > > > On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner <david@fromorbit.com> wrote:
> > > > > I unmounted the fs, mkfs'd it again, ran the
> > > > > workload again and about a minute in this fired:
> > > > >
> > > > > [628867.607417] ------------[ cut here ]------------
> > > > > [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 shadow_lru_isolate+0x171/0x220
> > > > 
> > > > Well, part of the changes during the merge window were the shadow
> > > > entry tracking changes that came in through Andrew's tree. Adding
> > > > Johannes Weiner to the participants.
> > > > 
> > > > > Now, this workload does not touch the page cache at all - it's
> > > > > entirely an XFS metadata workload, so it should not really be
> > > > > affecting the working set code.
> > > > 
> > > > Well, I suspect that anything that creates memory pressure will end up
> > > > triggering the working set code, so ..
> > > > 
> > > > That said, obviously memory corruption could be involved and result in
> > > > random issues too, but I wouldn't really expect that in this code.
> > > > 
> > > > It would probably be really useful to get more data points - is the
> > > > problem reliably in this area, or is it going to be random and all
> > > > over the place.
> > > 
> > > Data point: kswapd got WARNING on mm/workingset.c:457 in shadow_lru_isolate,
> > > soon followed by NULL pointer deref in list_lru_isolate, one time when
> > > I tried out Sunday's git tree.  Not seen since, I haven't had time to
> > > investigate, just set it aside as something to worry about if it happens
> > > again.  But it looks like shadow_lru_isolate() has issues beyond Dave's
> > > case (I've no XFS and no iscsi), suspect unrelated to his other problems.
> > 
> > This seems consistent with what Dave observed: we encounter regular
> > pages in radix tree nodes on the shadow LRU that should only contain
> > nodes full of exceptional shadow entries. It could be an issue in the
> > new slot replacement code and the node tracking callback.
> 
> Both encounters seem to indicate use-after-free. Dave's node didn't
> warn about an unexpected node->count / node->exceptional state, but
> had entries that were inconsistent with that. Hugh got the counter
> warning but crashed on a list_head that's not NULLed in a live node.
> 
> workingset_update_node() should be called on page cache radix tree
> leaf nodes that go empty. I must be missing an update_node callback
> where a leaf node gets freed somewhere.

Sorry for dropping silent on this. I'm traveling over the holidays
with sporadic access to my emails and no access to real equipment.

The times I managed to sneak away to look at the code didn't turn up
anything useful yet.

Andrea encountered the warning as well and I gave him a debugging
patch (attached below), but he hasn't been able to reproduce this
condition. I've personally never seen the warning trigger, even though
the patches have been running on my main development machine for quite
a while now. Albeit against an older base; I've updated to Linus's
master branch now in case it's an interaction with other new code.

If anybody manages to reproduce this, that would be helpful. Any extra
eyes on this would be much appreciated too until I'm back at my desk.

Thanks

diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 6f382e07de77..0783af1c0ebb 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -640,6 +640,8 @@ static inline void radix_tree_shrink(struct radix_tree_root *root,
 				update_node(node, private);
 		}
 
+		WARN_ON_ONCE(!list_empty(&node->private_list));
+
 		radix_tree_node_free(node);
 	}
 }
@@ -666,6 +668,8 @@ static void delete_node(struct radix_tree_root *root,
 			root->rnode = NULL;
 		}
 
+		WARN_ON_ONCE(!list_empty(&node->private_list));
+
 		radix_tree_node_free(node);
 
 		node = parent;
@@ -767,6 +771,7 @@ static void radix_tree_free_nodes(struct radix_tree_node *node)
 			struct radix_tree_node *old = child;
 			offset = child->offset + 1;
 			child = child->parent;
+			WARN_ON_ONCE(!list_empty(&node->private_list));
 			radix_tree_node_free(old);
 			if (old == entry_to_node(node))
 				return;


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
@ 2017-01-02 21:11                       ` Johannes Weiner
  0 siblings, 0 replies; 57+ messages in thread
From: Johannes Weiner @ 2017-01-02 21:11 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, Dave Chinner, Chris Leech,
	Linux Kernel Mailing List, Lee Duncan,
	open-iscsi-/JYPxA39Uh5TLH3MbocFFw, Linux SCSI List,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig, Jan Kara,
	Andrea Arcangeli

On Fri, Dec 23, 2016 at 03:33:29AM -0500, Johannes Weiner wrote:
> On Fri, Dec 23, 2016 at 02:32:41AM -0500, Johannes Weiner wrote:
> > On Thu, Dec 22, 2016 at 12:22:27PM -0800, Hugh Dickins wrote:
> > > On Wed, 21 Dec 2016, Linus Torvalds wrote:
> > > > On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org> wrote:
> > > > > I unmounted the fs, mkfs'd it again, ran the
> > > > > workload again and about a minute in this fired:
> > > > >
> > > > > [628867.607417] ------------[ cut here ]------------
> > > > > [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 shadow_lru_isolate+0x171/0x220
> > > > 
> > > > Well, part of the changes during the merge window were the shadow
> > > > entry tracking changes that came in through Andrew's tree. Adding
> > > > Johannes Weiner to the participants.
> > > > 
> > > > > Now, this workload does not touch the page cache at all - it's
> > > > > entirely an XFS metadata workload, so it should not really be
> > > > > affecting the working set code.
> > > > 
> > > > Well, I suspect that anything that creates memory pressure will end up
> > > > triggering the working set code, so ..
> > > > 
> > > > That said, obviously memory corruption could be involved and result in
> > > > random issues too, but I wouldn't really expect that in this code.
> > > > 
> > > > It would probably be really useful to get more data points - is the
> > > > problem reliably in this area, or is it going to be random and all
> > > > over the place.
> > > 
> > > Data point: kswapd got WARNING on mm/workingset.c:457 in shadow_lru_isolate,
> > > soon followed by NULL pointer deref in list_lru_isolate, one time when
> > > I tried out Sunday's git tree.  Not seen since, I haven't had time to
> > > investigate, just set it aside as something to worry about if it happens
> > > again.  But it looks like shadow_lru_isolate() has issues beyond Dave's
> > > case (I've no XFS and no iscsi), suspect unrelated to his other problems.
> > 
> > This seems consistent with what Dave observed: we encounter regular
> > pages in radix tree nodes on the shadow LRU that should only contain
> > nodes full of exceptional shadow entries. It could be an issue in the
> > new slot replacement code and the node tracking callback.
> 
> Both encounters seem to indicate use-after-free. Dave's node didn't
> warn about an unexpected node->count / node->exceptional state, but
> had entries that were inconsistent with that. Hugh got the counter
> warning but crashed on a list_head that's not NULLed in a live node.
> 
> workingset_update_node() should be called on page cache radix tree
> leaf nodes that go empty. I must be missing an update_node callback
> where a leaf node gets freed somewhere.

Sorry for dropping silent on this. I'm traveling over the holidays
with sporadic access to my emails and no access to real equipment.

The times I managed to sneak away to look at the code didn't turn up
anything useful yet.

Andrea encountered the warning as well and I gave him a debugging
patch (attached below), but he hasn't been able to reproduce this
condition. I've personally never seen the warning trigger, even though
the patches have been running on my main development machine for quite
a while now. Albeit against an older base; I've updated to Linus's
master branch now in case it's an interaction with other new code.

If anybody manages to reproduce this, that would be helpful. Any extra
eyes on this would be much appreciated too until I'm back at my desk.

Thanks

diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 6f382e07de77..0783af1c0ebb 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -640,6 +640,8 @@ static inline void radix_tree_shrink(struct radix_tree_root *root,
 				update_node(node, private);
 		}
 
+		WARN_ON_ONCE(!list_empty(&node->private_list));
+
 		radix_tree_node_free(node);
 	}
 }
@@ -666,6 +668,8 @@ static void delete_node(struct radix_tree_root *root,
 			root->rnode = NULL;
 		}
 
+		WARN_ON_ONCE(!list_empty(&node->private_list));
+
 		radix_tree_node_free(node);
 
 		node = parent;
@@ -767,6 +771,7 @@ static void radix_tree_free_nodes(struct radix_tree_node *node)
 			struct radix_tree_node *old = child;
 			offset = child->offset + 1;
 			child = child->parent;
+			WARN_ON_ONCE(!list_empty(&node->private_list));
 			radix_tree_node_free(old);
 			if (old == entry_to_node(node))
 				return;

-- 
You received this message because you are subscribed to the Google Groups "open-iscsi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to open-iscsi+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to open-iscsi-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
Visit this group at https://groups.google.com/group/open-iscsi.
For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
  2017-01-02 21:11                       ` Johannes Weiner
  (?)
@ 2017-01-03 12:28                       ` Jan Kara
  2017-01-04 15:26                         ` Laurence Oberman
  2017-01-08  2:02                           ` Johannes Weiner
  -1 siblings, 2 replies; 57+ messages in thread
From: Jan Kara @ 2017-01-03 12:28 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Hugh Dickins, Linus Torvalds, Dave Chinner, Chris Leech,
	Linux Kernel Mailing List, Lee Duncan, open-iscsi,
	Linux SCSI List, linux-block, Christoph Hellwig, Jan Kara,
	Andrea Arcangeli

On Mon 02-01-17 16:11:36, Johannes Weiner wrote:
> On Fri, Dec 23, 2016 at 03:33:29AM -0500, Johannes Weiner wrote:
> > On Fri, Dec 23, 2016 at 02:32:41AM -0500, Johannes Weiner wrote:
> > > On Thu, Dec 22, 2016 at 12:22:27PM -0800, Hugh Dickins wrote:
> > > > On Wed, 21 Dec 2016, Linus Torvalds wrote:
> > > > > On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner <david@fromorbit.com> wrote:
> > > > > > I unmounted the fs, mkfs'd it again, ran the
> > > > > > workload again and about a minute in this fired:
> > > > > >
> > > > > > [628867.607417] ------------[ cut here ]------------
> > > > > > [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 shadow_lru_isolate+0x171/0x220
> > > > > 
> > > > > Well, part of the changes during the merge window were the shadow
> > > > > entry tracking changes that came in through Andrew's tree. Adding
> > > > > Johannes Weiner to the participants.
> > > > > 
> > > > > > Now, this workload does not touch the page cache at all - it's
> > > > > > entirely an XFS metadata workload, so it should not really be
> > > > > > affecting the working set code.
> > > > > 
> > > > > Well, I suspect that anything that creates memory pressure will end up
> > > > > triggering the working set code, so ..
> > > > > 
> > > > > That said, obviously memory corruption could be involved and result in
> > > > > random issues too, but I wouldn't really expect that in this code.
> > > > > 
> > > > > It would probably be really useful to get more data points - is the
> > > > > problem reliably in this area, or is it going to be random and all
> > > > > over the place.
> > > > 
> > > > Data point: kswapd got WARNING on mm/workingset.c:457 in shadow_lru_isolate,
> > > > soon followed by NULL pointer deref in list_lru_isolate, one time when
> > > > I tried out Sunday's git tree.  Not seen since, I haven't had time to
> > > > investigate, just set it aside as something to worry about if it happens
> > > > again.  But it looks like shadow_lru_isolate() has issues beyond Dave's
> > > > case (I've no XFS and no iscsi), suspect unrelated to his other problems.
> > > 
> > > This seems consistent with what Dave observed: we encounter regular
> > > pages in radix tree nodes on the shadow LRU that should only contain
> > > nodes full of exceptional shadow entries. It could be an issue in the
> > > new slot replacement code and the node tracking callback.
> > 
> > Both encounters seem to indicate use-after-free. Dave's node didn't
> > warn about an unexpected node->count / node->exceptional state, but
> > had entries that were inconsistent with that. Hugh got the counter
> > warning but crashed on a list_head that's not NULLed in a live node.
> > 
> > workingset_update_node() should be called on page cache radix tree
> > leaf nodes that go empty. I must be missing an update_node callback
> > where a leaf node gets freed somewhere.
> 
> Sorry for dropping silent on this. I'm traveling over the holidays
> with sporadic access to my emails and no access to real equipment.
> 
> The times I managed to sneak away to look at the code didn't turn up
> anything useful yet.
> 
> Andrea encountered the warning as well and I gave him a debugging
> patch (attached below), but he hasn't been able to reproduce this
> condition. I've personally never seen the warning trigger, even though
> the patches have been running on my main development machine for quite
> a while now. Albeit against an older base; I've updated to Linus's
> master branch now in case it's an interaction with other new code.
> 
> If anybody manages to reproduce this, that would be helpful. Any extra
> eyes on this would be much appreciated too until I'm back at my desk.

I was looking into this but I didn't find a way how we could possibly leave
radix tree node on LRU. So your debug patch looks like a good way forward.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
@ 2017-01-04 14:07                                 ` Christoph Hellwig
  0 siblings, 0 replies; 57+ messages in thread
From: Christoph Hellwig @ 2017-01-04 14:07 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Christoph Hellwig, Linus Torvalds, Chris Leech, Ming Lei,
	Dave Chinner, Johannes Weiner, Linux Kernel Mailing List,
	Lee Duncan, open-iscsi, Linux SCSI List, linux-block, Jens Axboe,
	Michael S. Tsirkin

On Sat, Dec 24, 2016 at 02:17:26PM +0100, Hannes Reinecke wrote:
> Christoph, do you have a pointer to your patchset?

Here is a pointer to the current one after splitting it into properly
bisectable chunks.  Besides proper changelogs the biggest item left
is fixing up dm-mpath to not allocate its own request structures.

http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/block-pc-refactor

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
@ 2017-01-04 14:07                                 ` Christoph Hellwig
  0 siblings, 0 replies; 57+ messages in thread
From: Christoph Hellwig @ 2017-01-04 14:07 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Christoph Hellwig, Linus Torvalds, Chris Leech, Ming Lei,
	Dave Chinner, Johannes Weiner, Linux Kernel Mailing List,
	Lee Duncan, open-iscsi-/JYPxA39Uh5TLH3MbocFFw, Linux SCSI List,
	linux-block, Jens Axboe, Michael S. Tsirkin

On Sat, Dec 24, 2016 at 02:17:26PM +0100, Hannes Reinecke wrote:
> Christoph, do you have a pointer to your patchset?

Here is a pointer to the current one after splitting it into properly
bisectable chunks.  Besides proper changelogs the biggest item left
is fixing up dm-mpath to not allocate its own request structures.

http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/block-pc-refactor

-- 
You received this message because you are subscribed to the Google Groups "open-iscsi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to open-iscsi+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to open-iscsi-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
Visit this group at https://groups.google.com/group/open-iscsi.
For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
  2017-01-03 12:28                       ` Jan Kara
@ 2017-01-04 15:26                         ` Laurence Oberman
  2017-01-04 17:38                             ` Laurence Oberman
  2017-01-08  2:02                           ` Johannes Weiner
  1 sibling, 1 reply; 57+ messages in thread
From: Laurence Oberman @ 2017-01-04 15:26 UTC (permalink / raw)
  To: Jan Kara
  Cc: Johannes Weiner, Hugh Dickins, Linus Torvalds, Dave Chinner,
	Chris Leech, Linux Kernel Mailing List, Lee Duncan, open-iscsi,
	Linux SCSI List, linux-block, Christoph Hellwig,
	Andrea Arcangeli



----- Original Message -----
> From: "Jan Kara" <jack@suse.cz>
> To: "Johannes Weiner" <hannes@cmpxchg.org>
> Cc: "Hugh Dickins" <hughd@google.com>, "Linus Torvalds" <torvalds@linux-foundation.org>, "Dave Chinner"
> <david@fromorbit.com>, "Chris Leech" <cleech@redhat.com>, "Linux Kernel Mailing List"
> <linux-kernel@vger.kernel.org>, "Lee Duncan" <lduncan@suse.com>, open-iscsi@googlegroups.com, "Linux SCSI List"
> <linux-scsi@vger.kernel.org>, linux-block@vger.kernel.org, "Christoph Hellwig" <hch@lst.de>, "Jan Kara"
> <jack@suse.cz>, "Andrea Arcangeli" <aarcange@redhat.com>
> Sent: Tuesday, January 3, 2017 7:28:25 AM
> Subject: Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
> 
> On Mon 02-01-17 16:11:36, Johannes Weiner wrote:
> > On Fri, Dec 23, 2016 at 03:33:29AM -0500, Johannes Weiner wrote:
> > > On Fri, Dec 23, 2016 at 02:32:41AM -0500, Johannes Weiner wrote:
> > > > On Thu, Dec 22, 2016 at 12:22:27PM -0800, Hugh Dickins wrote:
> > > > > On Wed, 21 Dec 2016, Linus Torvalds wrote:
> > > > > > On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner <david@fromorbit.com>
> > > > > > wrote:
> > > > > > > I unmounted the fs, mkfs'd it again, ran the
> > > > > > > workload again and about a minute in this fired:
> > > > > > >
> > > > > > > [628867.607417] ------------[ cut here ]------------
> > > > > > > [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461
> > > > > > > shadow_lru_isolate+0x171/0x220
> > > > > > 
> > > > > > Well, part of the changes during the merge window were the shadow
> > > > > > entry tracking changes that came in through Andrew's tree. Adding
> > > > > > Johannes Weiner to the participants.
> > > > > > 
> > > > > > > Now, this workload does not touch the page cache at all - it's
> > > > > > > entirely an XFS metadata workload, so it should not really be
> > > > > > > affecting the working set code.
> > > > > > 
> > > > > > Well, I suspect that anything that creates memory pressure will end
> > > > > > up
> > > > > > triggering the working set code, so ..
> > > > > > 
> > > > > > That said, obviously memory corruption could be involved and result
> > > > > > in
> > > > > > random issues too, but I wouldn't really expect that in this code.
> > > > > > 
> > > > > > It would probably be really useful to get more data points - is the
> > > > > > problem reliably in this area, or is it going to be random and all
> > > > > > over the place.
> > > > > 
> > > > > Data point: kswapd got WARNING on mm/workingset.c:457 in
> > > > > shadow_lru_isolate,
> > > > > soon followed by NULL pointer deref in list_lru_isolate, one time
> > > > > when
> > > > > I tried out Sunday's git tree.  Not seen since, I haven't had time to
> > > > > investigate, just set it aside as something to worry about if it
> > > > > happens
> > > > > again.  But it looks like shadow_lru_isolate() has issues beyond
> > > > > Dave's
> > > > > case (I've no XFS and no iscsi), suspect unrelated to his other
> > > > > problems.
> > > > 
> > > > This seems consistent with what Dave observed: we encounter regular
> > > > pages in radix tree nodes on the shadow LRU that should only contain
> > > > nodes full of exceptional shadow entries. It could be an issue in the
> > > > new slot replacement code and the node tracking callback.
> > > 
> > > Both encounters seem to indicate use-after-free. Dave's node didn't
> > > warn about an unexpected node->count / node->exceptional state, but
> > > had entries that were inconsistent with that. Hugh got the counter
> > > warning but crashed on a list_head that's not NULLed in a live node.
> > > 
> > > workingset_update_node() should be called on page cache radix tree
> > > leaf nodes that go empty. I must be missing an update_node callback
> > > where a leaf node gets freed somewhere.
> > 
> > Sorry for dropping silent on this. I'm traveling over the holidays
> > with sporadic access to my emails and no access to real equipment.
> > 
> > The times I managed to sneak away to look at the code didn't turn up
> > anything useful yet.
> > 
> > Andrea encountered the warning as well and I gave him a debugging
> > patch (attached below), but he hasn't been able to reproduce this
> > condition. I've personally never seen the warning trigger, even though
> > the patches have been running on my main development machine for quite
> > a while now. Albeit against an older base; I've updated to Linus's
> > master branch now in case it's an interaction with other new code.
> > 
> > If anybody manages to reproduce this, that would be helpful. Any extra
> > eyes on this would be much appreciated too until I'm back at my desk.
> 
> I was looking into this but I didn't find a way how we could possibly leave
> radix tree node on LRU. So your debug patch looks like a good way forward.
> 
> 								Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

Running on 4.10.0-rc2+ and days of ISCSI served XFS load have not reproduced this yet for me with an 4-core (hyper 8) system.
Using ISCSI over Ethernet controller: Intel Corporation Ethernet Connection I217-LM

I will now try with a full offload ISCSI HBA and see if if  have any luck.



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
@ 2017-01-04 17:38                             ` Laurence Oberman
  0 siblings, 0 replies; 57+ messages in thread
From: Laurence Oberman @ 2017-01-04 17:38 UTC (permalink / raw)
  To: Jan Kara
  Cc: Johannes Weiner, Hugh Dickins, Linus Torvalds, Dave Chinner,
	Chris Leech, Linux Kernel Mailing List, Lee Duncan, open-iscsi,
	Linux SCSI List, linux-block, Christoph Hellwig,
	Andrea Arcangeli



----- Original Message -----
> From: "Laurence Oberman" <loberman@redhat.com>
> To: "Jan Kara" <jack@suse.cz>
> Cc: "Johannes Weiner" <hannes@cmpxchg.org>, "Hugh Dickins" <hughd@google.com>, "Linus Torvalds"
> <torvalds@linux-foundation.org>, "Dave Chinner" <david@fromorbit.com>, "Chris Leech" <cleech@redhat.com>, "Linux
> Kernel Mailing List" <linux-kernel@vger.kernel.org>, "Lee Duncan" <lduncan@suse.com>, open-iscsi@googlegroups.com,
> "Linux SCSI List" <linux-scsi@vger.kernel.org>, linux-block@vger.kernel.org, "Christoph Hellwig" <hch@lst.de>,
> "Andrea Arcangeli" <aarcange@redhat.com>
> Sent: Wednesday, January 4, 2017 10:26:09 AM
> Subject: Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
> 
> 
> 
> ----- Original Message -----
> > From: "Jan Kara" <jack@suse.cz>
> > To: "Johannes Weiner" <hannes@cmpxchg.org>
> > Cc: "Hugh Dickins" <hughd@google.com>, "Linus Torvalds"
> > <torvalds@linux-foundation.org>, "Dave Chinner"
> > <david@fromorbit.com>, "Chris Leech" <cleech@redhat.com>, "Linux Kernel
> > Mailing List"
> > <linux-kernel@vger.kernel.org>, "Lee Duncan" <lduncan@suse.com>,
> > open-iscsi@googlegroups.com, "Linux SCSI List"
> > <linux-scsi@vger.kernel.org>, linux-block@vger.kernel.org, "Christoph
> > Hellwig" <hch@lst.de>, "Jan Kara"
> > <jack@suse.cz>, "Andrea Arcangeli" <aarcange@redhat.com>
> > Sent: Tuesday, January 3, 2017 7:28:25 AM
> > Subject: Re: [4.10, panic, regression] iscsi: null pointer deref at
> > iscsi_tcp_segment_done+0x20d/0x2e0
> > 
> > On Mon 02-01-17 16:11:36, Johannes Weiner wrote:
> > > On Fri, Dec 23, 2016 at 03:33:29AM -0500, Johannes Weiner wrote:
> > > > On Fri, Dec 23, 2016 at 02:32:41AM -0500, Johannes Weiner wrote:
> > > > > On Thu, Dec 22, 2016 at 12:22:27PM -0800, Hugh Dickins wrote:
> > > > > > On Wed, 21 Dec 2016, Linus Torvalds wrote:
> > > > > > > On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner
> > > > > > > <david@fromorbit.com>
> > > > > > > wrote:
> > > > > > > > I unmounted the fs, mkfs'd it again, ran the
> > > > > > > > workload again and about a minute in this fired:
> > > > > > > >
> > > > > > > > [628867.607417] ------------[ cut here ]------------
> > > > > > > > [628867.608603] WARNING: CPU: 2 PID: 16925 at
> > > > > > > > mm/workingset.c:461
> > > > > > > > shadow_lru_isolate+0x171/0x220
> > > > > > > 
> > > > > > > Well, part of the changes during the merge window were the shadow
> > > > > > > entry tracking changes that came in through Andrew's tree. Adding
> > > > > > > Johannes Weiner to the participants.
> > > > > > > 
> > > > > > > > Now, this workload does not touch the page cache at all - it's
> > > > > > > > entirely an XFS metadata workload, so it should not really be
> > > > > > > > affecting the working set code.
> > > > > > > 
> > > > > > > Well, I suspect that anything that creates memory pressure will
> > > > > > > end
> > > > > > > up
> > > > > > > triggering the working set code, so ..
> > > > > > > 
> > > > > > > That said, obviously memory corruption could be involved and
> > > > > > > result
> > > > > > > in
> > > > > > > random issues too, but I wouldn't really expect that in this
> > > > > > > code.
> > > > > > > 
> > > > > > > It would probably be really useful to get more data points - is
> > > > > > > the
> > > > > > > problem reliably in this area, or is it going to be random and
> > > > > > > all
> > > > > > > over the place.
> > > > > > 
> > > > > > Data point: kswapd got WARNING on mm/workingset.c:457 in
> > > > > > shadow_lru_isolate,
> > > > > > soon followed by NULL pointer deref in list_lru_isolate, one time
> > > > > > when
> > > > > > I tried out Sunday's git tree.  Not seen since, I haven't had time
> > > > > > to
> > > > > > investigate, just set it aside as something to worry about if it
> > > > > > happens
> > > > > > again.  But it looks like shadow_lru_isolate() has issues beyond
> > > > > > Dave's
> > > > > > case (I've no XFS and no iscsi), suspect unrelated to his other
> > > > > > problems.
> > > > > 
> > > > > This seems consistent with what Dave observed: we encounter regular
> > > > > pages in radix tree nodes on the shadow LRU that should only contain
> > > > > nodes full of exceptional shadow entries. It could be an issue in the
> > > > > new slot replacement code and the node tracking callback.
> > > > 
> > > > Both encounters seem to indicate use-after-free. Dave's node didn't
> > > > warn about an unexpected node->count / node->exceptional state, but
> > > > had entries that were inconsistent with that. Hugh got the counter
> > > > warning but crashed on a list_head that's not NULLed in a live node.
> > > > 
> > > > workingset_update_node() should be called on page cache radix tree
> > > > leaf nodes that go empty. I must be missing an update_node callback
> > > > where a leaf node gets freed somewhere.
> > > 
> > > Sorry for dropping silent on this. I'm traveling over the holidays
> > > with sporadic access to my emails and no access to real equipment.
> > > 
> > > The times I managed to sneak away to look at the code didn't turn up
> > > anything useful yet.
> > > 
> > > Andrea encountered the warning as well and I gave him a debugging
> > > patch (attached below), but he hasn't been able to reproduce this
> > > condition. I've personally never seen the warning trigger, even though
> > > the patches have been running on my main development machine for quite
> > > a while now. Albeit against an older base; I've updated to Linus's
> > > master branch now in case it's an interaction with other new code.
> > > 
> > > If anybody manages to reproduce this, that would be helpful. Any extra
> > > eyes on this would be much appreciated too until I'm back at my desk.
> > 
> > I was looking into this but I didn't find a way how we could possibly leave
> > radix tree node on LRU. So your debug patch looks like a good way forward.
> > 
> > 								Honza
> > --
> > Jan Kara <jack@suse.com>
> > SUSE Labs, CR
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> 
> Running on 4.10.0-rc2+ and days of ISCSI served XFS load have not reproduced
> this yet for me with an 4-core (hyper 8) system.
> Using ISCSI over Ethernet controller: Intel Corporation Ethernet Connection
> I217-LM
> 
> I will now try with a full offload ISCSI HBA and see if if  have any luck.
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

I will leave this running now against a full offload connection.
If it triggers will add the debug and reproduce it again and share with the list here.

Kernel 4.10.0-rc2+ on an x86_64

#!/bin/bash
while true
do
	mkfs -t xfs -f /dev/sdav
        mount /dev/sdav /mnt
        cp -Rp /usr /mnt
        umount /mnt
        sleep 10
done


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
@ 2017-01-04 17:38                             ` Laurence Oberman
  0 siblings, 0 replies; 57+ messages in thread
From: Laurence Oberman @ 2017-01-04 17:38 UTC (permalink / raw)
  To: Jan Kara
  Cc: Johannes Weiner, Hugh Dickins, Linus Torvalds, Dave Chinner,
	Chris Leech, Linux Kernel Mailing List, Lee Duncan,
	open-iscsi-/JYPxA39Uh5TLH3MbocFFw, Linux SCSI List,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig,
	Andrea Arcangeli



----- Original Message -----
> From: "Laurence Oberman" <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> To: "Jan Kara" <jack-AlSwsSmVLrQ@public.gmane.org>
> Cc: "Johannes Weiner" <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>, "Hugh Dickins" <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>, "Linus Torvalds"
> <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>, "Dave Chinner" <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org>, "Chris Leech" <cleech-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>, "Linux
> Kernel Mailing List" <linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>, "Lee Duncan" <lduncan-IBi9RG/b67k@public.gmane.org>, open-iscsi-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org,
> "Linux SCSI List" <linux-scsi-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>, linux-block-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, "Christoph Hellwig" <hch-jcswGhMUV9g@public.gmane.org>,
> "Andrea Arcangeli" <aarcange-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Sent: Wednesday, January 4, 2017 10:26:09 AM
> Subject: Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
> 
> 
> 
> ----- Original Message -----
> > From: "Jan Kara" <jack-AlSwsSmVLrQ@public.gmane.org>
> > To: "Johannes Weiner" <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> > Cc: "Hugh Dickins" <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>, "Linus Torvalds"
> > <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>, "Dave Chinner"
> > <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org>, "Chris Leech" <cleech-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>, "Linux Kernel
> > Mailing List"
> > <linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>, "Lee Duncan" <lduncan-IBi9RG/b67k@public.gmane.org>,
> > open-iscsi-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org, "Linux SCSI List"
> > <linux-scsi-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>, linux-block-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, "Christoph
> > Hellwig" <hch-jcswGhMUV9g@public.gmane.org>, "Jan Kara"
> > <jack-AlSwsSmVLrQ@public.gmane.org>, "Andrea Arcangeli" <aarcange-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > Sent: Tuesday, January 3, 2017 7:28:25 AM
> > Subject: Re: [4.10, panic, regression] iscsi: null pointer deref at
> > iscsi_tcp_segment_done+0x20d/0x2e0
> > 
> > On Mon 02-01-17 16:11:36, Johannes Weiner wrote:
> > > On Fri, Dec 23, 2016 at 03:33:29AM -0500, Johannes Weiner wrote:
> > > > On Fri, Dec 23, 2016 at 02:32:41AM -0500, Johannes Weiner wrote:
> > > > > On Thu, Dec 22, 2016 at 12:22:27PM -0800, Hugh Dickins wrote:
> > > > > > On Wed, 21 Dec 2016, Linus Torvalds wrote:
> > > > > > > On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner
> > > > > > > <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org>
> > > > > > > wrote:
> > > > > > > > I unmounted the fs, mkfs'd it again, ran the
> > > > > > > > workload again and about a minute in this fired:
> > > > > > > >
> > > > > > > > [628867.607417] ------------[ cut here ]------------
> > > > > > > > [628867.608603] WARNING: CPU: 2 PID: 16925 at
> > > > > > > > mm/workingset.c:461
> > > > > > > > shadow_lru_isolate+0x171/0x220
> > > > > > > 
> > > > > > > Well, part of the changes during the merge window were the shadow
> > > > > > > entry tracking changes that came in through Andrew's tree. Adding
> > > > > > > Johannes Weiner to the participants.
> > > > > > > 
> > > > > > > > Now, this workload does not touch the page cache at all - it's
> > > > > > > > entirely an XFS metadata workload, so it should not really be
> > > > > > > > affecting the working set code.
> > > > > > > 
> > > > > > > Well, I suspect that anything that creates memory pressure will
> > > > > > > end
> > > > > > > up
> > > > > > > triggering the working set code, so ..
> > > > > > > 
> > > > > > > That said, obviously memory corruption could be involved and
> > > > > > > result
> > > > > > > in
> > > > > > > random issues too, but I wouldn't really expect that in this
> > > > > > > code.
> > > > > > > 
> > > > > > > It would probably be really useful to get more data points - is
> > > > > > > the
> > > > > > > problem reliably in this area, or is it going to be random and
> > > > > > > all
> > > > > > > over the place.
> > > > > > 
> > > > > > Data point: kswapd got WARNING on mm/workingset.c:457 in
> > > > > > shadow_lru_isolate,
> > > > > > soon followed by NULL pointer deref in list_lru_isolate, one time
> > > > > > when
> > > > > > I tried out Sunday's git tree.  Not seen since, I haven't had time
> > > > > > to
> > > > > > investigate, just set it aside as something to worry about if it
> > > > > > happens
> > > > > > again.  But it looks like shadow_lru_isolate() has issues beyond
> > > > > > Dave's
> > > > > > case (I've no XFS and no iscsi), suspect unrelated to his other
> > > > > > problems.
> > > > > 
> > > > > This seems consistent with what Dave observed: we encounter regular
> > > > > pages in radix tree nodes on the shadow LRU that should only contain
> > > > > nodes full of exceptional shadow entries. It could be an issue in the
> > > > > new slot replacement code and the node tracking callback.
> > > > 
> > > > Both encounters seem to indicate use-after-free. Dave's node didn't
> > > > warn about an unexpected node->count / node->exceptional state, but
> > > > had entries that were inconsistent with that. Hugh got the counter
> > > > warning but crashed on a list_head that's not NULLed in a live node.
> > > > 
> > > > workingset_update_node() should be called on page cache radix tree
> > > > leaf nodes that go empty. I must be missing an update_node callback
> > > > where a leaf node gets freed somewhere.
> > > 
> > > Sorry for dropping silent on this. I'm traveling over the holidays
> > > with sporadic access to my emails and no access to real equipment.
> > > 
> > > The times I managed to sneak away to look at the code didn't turn up
> > > anything useful yet.
> > > 
> > > Andrea encountered the warning as well and I gave him a debugging
> > > patch (attached below), but he hasn't been able to reproduce this
> > > condition. I've personally never seen the warning trigger, even though
> > > the patches have been running on my main development machine for quite
> > > a while now. Albeit against an older base; I've updated to Linus's
> > > master branch now in case it's an interaction with other new code.
> > > 
> > > If anybody manages to reproduce this, that would be helpful. Any extra
> > > eyes on this would be much appreciated too until I'm back at my desk.
> > 
> > I was looking into this but I didn't find a way how we could possibly leave
> > radix tree node on LRU. So your debug patch looks like a good way forward.
> > 
> > 								Honza
> > --
> > Jan Kara <jack-IBi9RG/b67k@public.gmane.org>
> > SUSE Labs, CR
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> 
> Running on 4.10.0-rc2+ and days of ISCSI served XFS load have not reproduced
> this yet for me with an 4-core (hyper 8) system.
> Using ISCSI over Ethernet controller: Intel Corporation Ethernet Connection
> I217-LM
> 
> I will now try with a full offload ISCSI HBA and see if if  have any luck.
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

I will leave this running now against a full offload connection.
If it triggers will add the debug and reproduce it again and share with the list here.

Kernel 4.10.0-rc2+ on an x86_64

#!/bin/bash
while true
do
	mkfs -t xfs -f /dev/sdav
        mount /dev/sdav /mnt
        cp -Rp /usr /mnt
        umount /mnt
        sleep 10
done

-- 
You received this message because you are subscribed to the Google Groups "open-iscsi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to open-iscsi+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to open-iscsi-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
Visit this group at https://groups.google.com/group/open-iscsi.
For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
@ 2017-01-08  2:02                           ` Johannes Weiner
  0 siblings, 0 replies; 57+ messages in thread
From: Johannes Weiner @ 2017-01-08  2:02 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton
  Cc: Hugh Dickins, Dave Chinner, Chris Leech,
	Linux Kernel Mailing List, Lee Duncan, open-iscsi,
	Linux SCSI List, linux-block, Christoph Hellwig,
	Andrea Arcangeli, Dave Jones, Jan Kara

On Tue, Jan 03, 2017 at 01:28:25PM +0100, Jan Kara wrote:
> On Mon 02-01-17 16:11:36, Johannes Weiner wrote:
> > On Fri, Dec 23, 2016 at 03:33:29AM -0500, Johannes Weiner wrote:
> > > On Fri, Dec 23, 2016 at 02:32:41AM -0500, Johannes Weiner wrote:
> > > > On Thu, Dec 22, 2016 at 12:22:27PM -0800, Hugh Dickins wrote:
> > > > > On Wed, 21 Dec 2016, Linus Torvalds wrote:
> > > > > > On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner <david@fromorbit.com> wrote:
> > > > > > > I unmounted the fs, mkfs'd it again, ran the
> > > > > > > workload again and about a minute in this fired:
> > > > > > >
> > > > > > > [628867.607417] ------------[ cut here ]------------
> > > > > > > [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 shadow_lru_isolate+0x171/0x220
> > > > > > 
> > > > > > Well, part of the changes during the merge window were the shadow
> > > > > > entry tracking changes that came in through Andrew's tree. Adding
> > > > > > Johannes Weiner to the participants.

Okay, the below patch should address this problem. Dave Jones managed
to reproduce it with the added WARN_ONs, and they made it obvious. He
cannot trigger it anymore with this fix applied. Thanks Dave!

Linus? Andrew?

---

>From 503eeb20e68bdf3529bdc14aca1ce564880129f2 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Fri, 6 Jan 2017 19:21:43 -0500
Subject: [PATCH] mm: workingset: fix use-after-free in shadow node shrinker

Several people report seeing warnings about inconsistent radix tree
nodes followed by crashes in the workingset code, which all looked
like use-after-free access from the shadow node shrinker. Dave Jones
managed to reproduce the issue with a debug patch applied, which
confirmed that the radix tree shrinking indeed frees shadow nodes
while they are still linked to the shadow LRU:

  WARNING: CPU: 2 PID: 53 at lib/radix-tree.c:643 delete_node+0x1e4/0x200
  CPU: 2 PID: 53 Comm: kswapd0 Not tainted 4.10.0-rc2-think+ #3
  Call Trace:
   dump_stack+0x4f/0x73
   __warn+0xcb/0xf0
   warn_slowpath_null+0x1d/0x20
   delete_node+0x1e4/0x200
   __radix_tree_delete_node+0xd/0x10
   shadow_lru_isolate+0xe6/0x220
   __list_lru_walk_one.isra.4+0x9b/0x190
   ? memcg_drain_all_list_lrus+0x1d0/0x1d0
   list_lru_walk_one+0x23/0x30
   scan_shadow_nodes+0x2e/0x40
   shrink_slab.part.44+0x23d/0x5d0
   ? 0xffffffffa023a077
   shrink_node+0x22c/0x330
   kswapd+0x392/0x8f0

This is the WARN_ON_ONCE(!list_empty(&node->private_list)) placed in
the inlined radix_tree_shrink().

The problem is with 14b468791fa9 ("mm: workingset: move shadow entry
tracking to radix tree exceptional tracking"), which passes an update
callback into the radix tree to link and unlink shadow leaf nodes when
tree entries change, but forgot to pass the callback when reclaiming a
shadow node. While the reclaimed shadow node itself is unlinked by the
shrinker, its deletion from the tree can cause the left-most leaf node
in the tree to be shrunk. If that happens to be a shadow node as well,
we don't unlink it from the LRU as we should.

Consider this tree, where the s are shadow entries:

     root->rnode
          |
     [0       n]
      |       |
   [s    ] [sssss]

Now the shadow node shrinker reclaims the rightmost leaf node through
the shadow node LRU:

     root->rnode
          |
     [0        ]
      |
  [s     ]

Because the parent of the deleted node is the first level below the
root and has only one child in the left-most slot, the intermediate
level is shrunk and the node containing the single shadow is put in
its place:

     root->rnode
          |
     [s        ]

The shrinker again sees a single left-most slot in a first level node
and thus decides to store the shadow in root->rnode directly and free
the node - which is a leaf node on the shadow node LRU.

root->rnode
     |
     s

Without the update callback, the freed node remains on the shadow LRU,
where it causes later shrinker runs to crash.

Pass the node updater callback into __radix_tree_delete_node() in case
the deletion causes the left-most branch in the tree to collapse too.

Also add warnings when linked nodes are freed right away, rather than
wait for the use-after-free when the list is scanned much later.

Fixes: 14b468791fa9 ("mm: workingset: move shadow entry tracking to radix tree exceptional tracking")
Reported-by: Dave Chinner <david@fromorbit.com>
Reported-by: Hugh Dickins <hughd@google.com>
Reported-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Tested-by: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/radix-tree.h |  4 +++-
 lib/radix-tree.c           | 11 +++++++++--
 mm/workingset.c            |  3 ++-
 3 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 5dea8f6440e4..52bda854593b 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -306,7 +306,9 @@ void radix_tree_iter_replace(struct radix_tree_root *,
 void radix_tree_replace_slot(struct radix_tree_root *root,
 			     void **slot, void *item);
 void __radix_tree_delete_node(struct radix_tree_root *root,
-			      struct radix_tree_node *node);
+			      struct radix_tree_node *node,
+			      radix_tree_update_node_t update_node,
+			      void *private);
 void *radix_tree_delete_item(struct radix_tree_root *, unsigned long, void *);
 void *radix_tree_delete(struct radix_tree_root *, unsigned long);
 void radix_tree_clear_tags(struct radix_tree_root *root,
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 6f382e07de77..0b92d605fb69 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -640,6 +640,7 @@ static inline void radix_tree_shrink(struct radix_tree_root *root,
 				update_node(node, private);
 		}

+		WARN_ON_ONCE(!list_empty(&node->private_list));
 		radix_tree_node_free(node);
 	}
 }
@@ -666,6 +667,7 @@ static void delete_node(struct radix_tree_root *root,
 			root->rnode = NULL;
 		}

+		WARN_ON_ONCE(!list_empty(&node->private_list));
 		radix_tree_node_free(node);

 		node = parent;
@@ -767,6 +769,7 @@ static void radix_tree_free_nodes(struct radix_tree_node *node)
 			struct radix_tree_node *old = child;
 			offset = child->offset + 1;
 			child = child->parent;
+			WARN_ON_ONCE(!list_empty(&node->private_list));
 			radix_tree_node_free(old);
 			if (old == entry_to_node(node))
 				return;
@@ -1824,15 +1827,19 @@ EXPORT_SYMBOL(radix_tree_gang_lookup_tag_slot);
  *	__radix_tree_delete_node    -    try to free node after clearing a slot
  *	@root:		radix tree root
  *	@node:		node containing @index
+ *	@update_node:	callback for changing leaf nodes
+ *	@private:	private data to pass to @update_node
  *
  *	After clearing the slot at @index in @node from radix tree
  *	rooted at @root, call this function to attempt freeing the
  *	node and shrinking the tree.
  */
 void __radix_tree_delete_node(struct radix_tree_root *root,
-			      struct radix_tree_node *node)
+			      struct radix_tree_node *node,
+			      radix_tree_update_node_t update_node,
+			      void *private)
 {
-	delete_node(root, node, NULL, NULL);
+	delete_node(root, node, update_node, private);
 }

 /**
diff --git a/mm/workingset.c b/mm/workingset.c
index 241fa5d6b3b2..abb58ffa3c64 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -473,7 +473,8 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
 	if (WARN_ON_ONCE(node->exceptional))
 		goto out_invalid;
 	inc_node_state(page_pgdat(virt_to_page(node)), WORKINGSET_NODERECLAIM);
-	__radix_tree_delete_node(&mapping->page_tree, node);
+	__radix_tree_delete_node(&mapping->page_tree, node,
+				 workingset_update_node, mapping);

 out_invalid:
 	spin_unlock(&mapping->tree_lock);
-- 
2.10.2

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
@ 2017-01-08  2:02                           ` Johannes Weiner
  0 siblings, 0 replies; 57+ messages in thread
From: Johannes Weiner @ 2017-01-08  2:02 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton
  Cc: Hugh Dickins, Dave Chinner, Chris Leech,
	Linux Kernel Mailing List, Lee Duncan,
	open-iscsi-/JYPxA39Uh5TLH3MbocFFw, Linux SCSI List,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig,
	Andrea Arcangeli, Dave Jones, Jan Kara

On Tue, Jan 03, 2017 at 01:28:25PM +0100, Jan Kara wrote:
> On Mon 02-01-17 16:11:36, Johannes Weiner wrote:
> > On Fri, Dec 23, 2016 at 03:33:29AM -0500, Johannes Weiner wrote:
> > > On Fri, Dec 23, 2016 at 02:32:41AM -0500, Johannes Weiner wrote:
> > > > On Thu, Dec 22, 2016 at 12:22:27PM -0800, Hugh Dickins wrote:
> > > > > On Wed, 21 Dec 2016, Linus Torvalds wrote:
> > > > > > On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org> wrote:
> > > > > > > I unmounted the fs, mkfs'd it again, ran the
> > > > > > > workload again and about a minute in this fired:
> > > > > > >
> > > > > > > [628867.607417] ------------[ cut here ]------------
> > > > > > > [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 shadow_lru_isolate+0x171/0x220
> > > > > > 
> > > > > > Well, part of the changes during the merge window were the shadow
> > > > > > entry tracking changes that came in through Andrew's tree. Adding
> > > > > > Johannes Weiner to the participants.

Okay, the below patch should address this problem. Dave Jones managed
to reproduce it with the added WARN_ONs, and they made it obvious. He
cannot trigger it anymore with this fix applied. Thanks Dave!

Linus? Andrew?

---

>From 503eeb20e68bdf3529bdc14aca1ce564880129f2 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Date: Fri, 6 Jan 2017 19:21:43 -0500
Subject: [PATCH] mm: workingset: fix use-after-free in shadow node shrinker

Several people report seeing warnings about inconsistent radix tree
nodes followed by crashes in the workingset code, which all looked
like use-after-free access from the shadow node shrinker. Dave Jones
managed to reproduce the issue with a debug patch applied, which
confirmed that the radix tree shrinking indeed frees shadow nodes
while they are still linked to the shadow LRU:

  WARNING: CPU: 2 PID: 53 at lib/radix-tree.c:643 delete_node+0x1e4/0x200
  CPU: 2 PID: 53 Comm: kswapd0 Not tainted 4.10.0-rc2-think+ #3
  Call Trace:
   dump_stack+0x4f/0x73
   __warn+0xcb/0xf0
   warn_slowpath_null+0x1d/0x20
   delete_node+0x1e4/0x200
   __radix_tree_delete_node+0xd/0x10
   shadow_lru_isolate+0xe6/0x220
   __list_lru_walk_one.isra.4+0x9b/0x190
   ? memcg_drain_all_list_lrus+0x1d0/0x1d0
   list_lru_walk_one+0x23/0x30
   scan_shadow_nodes+0x2e/0x40
   shrink_slab.part.44+0x23d/0x5d0
   ? 0xffffffffa023a077
   shrink_node+0x22c/0x330
   kswapd+0x392/0x8f0

This is the WARN_ON_ONCE(!list_empty(&node->private_list)) placed in
the inlined radix_tree_shrink().

The problem is with 14b468791fa9 ("mm: workingset: move shadow entry
tracking to radix tree exceptional tracking"), which passes an update
callback into the radix tree to link and unlink shadow leaf nodes when
tree entries change, but forgot to pass the callback when reclaiming a
shadow node. While the reclaimed shadow node itself is unlinked by the
shrinker, its deletion from the tree can cause the left-most leaf node
in the tree to be shrunk. If that happens to be a shadow node as well,
we don't unlink it from the LRU as we should.

Consider this tree, where the s are shadow entries:

     root->rnode
          |
     [0       n]
      |       |
   [s    ] [sssss]

Now the shadow node shrinker reclaims the rightmost leaf node through
the shadow node LRU:

     root->rnode
          |
     [0        ]
      |
  [s     ]

Because the parent of the deleted node is the first level below the
root and has only one child in the left-most slot, the intermediate
level is shrunk and the node containing the single shadow is put in
its place:

     root->rnode
          |
     [s        ]

The shrinker again sees a single left-most slot in a first level node
and thus decides to store the shadow in root->rnode directly and free
the node - which is a leaf node on the shadow node LRU.

root->rnode
     |
     s

Without the update callback, the freed node remains on the shadow LRU,
where it causes later shrinker runs to crash.

Pass the node updater callback into __radix_tree_delete_node() in case
the deletion causes the left-most branch in the tree to collapse too.

Also add warnings when linked nodes are freed right away, rather than
wait for the use-after-free when the list is scanned much later.

Fixes: 14b468791fa9 ("mm: workingset: move shadow entry tracking to radix tree exceptional tracking")
Reported-by: Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org>
Reported-by: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Reported-by: Andrea Arcangeli <aarcange-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Reported-by: Dave Jones <davej-rdkfGonbjUTCLXcRTR1eJlpr/1R2p/CL@public.gmane.org>
Tested-by: Dave Jones <davej-rdkfGonbjUTCLXcRTR1eJlpr/1R2p/CL@public.gmane.org>
Signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
---
 include/linux/radix-tree.h |  4 +++-
 lib/radix-tree.c           | 11 +++++++++--
 mm/workingset.c            |  3 ++-
 3 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 5dea8f6440e4..52bda854593b 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -306,7 +306,9 @@ void radix_tree_iter_replace(struct radix_tree_root *,
 void radix_tree_replace_slot(struct radix_tree_root *root,
 			     void **slot, void *item);
 void __radix_tree_delete_node(struct radix_tree_root *root,
-			      struct radix_tree_node *node);
+			      struct radix_tree_node *node,
+			      radix_tree_update_node_t update_node,
+			      void *private);
 void *radix_tree_delete_item(struct radix_tree_root *, unsigned long, void *);
 void *radix_tree_delete(struct radix_tree_root *, unsigned long);
 void radix_tree_clear_tags(struct radix_tree_root *root,
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 6f382e07de77..0b92d605fb69 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -640,6 +640,7 @@ static inline void radix_tree_shrink(struct radix_tree_root *root,
 				update_node(node, private);
 		}
 
+		WARN_ON_ONCE(!list_empty(&node->private_list));
 		radix_tree_node_free(node);
 	}
 }
@@ -666,6 +667,7 @@ static void delete_node(struct radix_tree_root *root,
 			root->rnode = NULL;
 		}
 
+		WARN_ON_ONCE(!list_empty(&node->private_list));
 		radix_tree_node_free(node);
 
 		node = parent;
@@ -767,6 +769,7 @@ static void radix_tree_free_nodes(struct radix_tree_node *node)
 			struct radix_tree_node *old = child;
 			offset = child->offset + 1;
 			child = child->parent;
+			WARN_ON_ONCE(!list_empty(&node->private_list));
 			radix_tree_node_free(old);
 			if (old == entry_to_node(node))
 				return;
@@ -1824,15 +1827,19 @@ EXPORT_SYMBOL(radix_tree_gang_lookup_tag_slot);
  *	__radix_tree_delete_node    -    try to free node after clearing a slot
  *	@root:		radix tree root
  *	@node:		node containing @index
+ *	@update_node:	callback for changing leaf nodes
+ *	@private:	private data to pass to @update_node
  *
  *	After clearing the slot at @index in @node from radix tree
  *	rooted at @root, call this function to attempt freeing the
  *	node and shrinking the tree.
  */
 void __radix_tree_delete_node(struct radix_tree_root *root,
-			      struct radix_tree_node *node)
+			      struct radix_tree_node *node,
+			      radix_tree_update_node_t update_node,
+			      void *private)
 {
-	delete_node(root, node, NULL, NULL);
+	delete_node(root, node, update_node, private);
 }
 
 /**
diff --git a/mm/workingset.c b/mm/workingset.c
index 241fa5d6b3b2..abb58ffa3c64 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -473,7 +473,8 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
 	if (WARN_ON_ONCE(node->exceptional))
 		goto out_invalid;
 	inc_node_state(page_pgdat(virt_to_page(node)), WORKINGSET_NODERECLAIM);
-	__radix_tree_delete_node(&mapping->page_tree, node);
+	__radix_tree_delete_node(&mapping->page_tree, node,
+				 workingset_update_node, mapping);
 
 out_invalid:
 	spin_unlock(&mapping->tree_lock);
-- 
2.10.2

-- 
You received this message because you are subscribed to the Google Groups "open-iscsi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to open-iscsi+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to open-iscsi-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
Visit this group at https://groups.google.com/group/open-iscsi.
For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
@ 2017-01-08  2:17                             ` Linus Torvalds
  0 siblings, 0 replies; 57+ messages in thread
From: Linus Torvalds @ 2017-01-08  2:17 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Hugh Dickins, Dave Chinner, Chris Leech,
	Linux Kernel Mailing List, Lee Duncan, open-iscsi,
	Linux SCSI List, linux-block, Christoph Hellwig,
	Andrea Arcangeli, Dave Jones, Jan Kara

On Sat, Jan 7, 2017 at 6:02 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> Linus? Andrew?

Looks fine to me. Will apply.

               Linus

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
@ 2017-01-08  2:17                             ` Linus Torvalds
  0 siblings, 0 replies; 57+ messages in thread
From: Linus Torvalds @ 2017-01-08  2:17 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Hugh Dickins, Dave Chinner, Chris Leech,
	Linux Kernel Mailing List, Lee Duncan,
	open-iscsi-/JYPxA39Uh5TLH3MbocFFw, Linux SCSI List, linux-block,
	Christoph Hellwig, Andrea Arcangeli, Dave Jones, Jan Kara

On Sat, Jan 7, 2017 at 6:02 PM, Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> wrote:
>
> Linus? Andrew?

Looks fine to me. Will apply.

               Linus

-- 
You received this message because you are subscribed to the Google Groups "open-iscsi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to open-iscsi+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to open-iscsi-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
Visit this group at https://groups.google.com/group/open-iscsi.
For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
@ 2017-01-09 20:30                             ` Jan Kara
  0 siblings, 0 replies; 57+ messages in thread
From: Jan Kara @ 2017-01-09 20:30 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Linus Torvalds, Andrew Morton, Hugh Dickins, Dave Chinner,
	Chris Leech, Linux Kernel Mailing List, Lee Duncan, open-iscsi,
	Linux SCSI List, linux-block, Christoph Hellwig,
	Andrea Arcangeli, Dave Jones, Jan Kara

On Sat 07-01-17 21:02:00, Johannes Weiner wrote:
> On Tue, Jan 03, 2017 at 01:28:25PM +0100, Jan Kara wrote:
> > On Mon 02-01-17 16:11:36, Johannes Weiner wrote:
> > > On Fri, Dec 23, 2016 at 03:33:29AM -0500, Johannes Weiner wrote:
> > > > On Fri, Dec 23, 2016 at 02:32:41AM -0500, Johannes Weiner wrote:
> > > > > On Thu, Dec 22, 2016 at 12:22:27PM -0800, Hugh Dickins wrote:
> > > > > > On Wed, 21 Dec 2016, Linus Torvalds wrote:
> > > > > > > On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner <david@fromorbit.com> wrote:
> > > > > > > > I unmounted the fs, mkfs'd it again, ran the
> > > > > > > > workload again and about a minute in this fired:
> > > > > > > >
> > > > > > > > [628867.607417] ------------[ cut here ]------------
> > > > > > > > [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 shadow_lru_isolate+0x171/0x220
> > > > > > > 
> > > > > > > Well, part of the changes during the merge window were the shadow
> > > > > > > entry tracking changes that came in through Andrew's tree. Adding
> > > > > > > Johannes Weiner to the participants.
> 
> Okay, the below patch should address this problem. Dave Jones managed
> to reproduce it with the added WARN_ONs, and they made it obvious. He
> cannot trigger it anymore with this fix applied. Thanks Dave!

FWIW the patch looks good to me. I'd just note that the need to pass the
callback to deletion function and the fact that we do it only in cases
where we think it is needed appears errorprone. With the warning you've
added it should at least catch the cases where we got it wrong but more
robust would be if the radix tree root contained a pointer to the callback
function so that we would not rely on passing the callback to every place
which can possibly free a node. Also conceptually this would make more
sense to me since the fact that we may need to do some cleanup on node
deletion is a property of the particular radix tree and how we use it.
OTOH that would mean growing radix tree root by one pointer which would be
unpopular I guess.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
@ 2017-01-09 20:30                             ` Jan Kara
  0 siblings, 0 replies; 57+ messages in thread
From: Jan Kara @ 2017-01-09 20:30 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Linus Torvalds, Andrew Morton, Hugh Dickins, Dave Chinner,
	Chris Leech, Linux Kernel Mailing List, Lee Duncan,
	open-iscsi-/JYPxA39Uh5TLH3MbocFFw, Linux SCSI List,
	linux-block-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig,
	Andrea Arcangeli, Dave Jones, Jan Kara

On Sat 07-01-17 21:02:00, Johannes Weiner wrote:
> On Tue, Jan 03, 2017 at 01:28:25PM +0100, Jan Kara wrote:
> > On Mon 02-01-17 16:11:36, Johannes Weiner wrote:
> > > On Fri, Dec 23, 2016 at 03:33:29AM -0500, Johannes Weiner wrote:
> > > > On Fri, Dec 23, 2016 at 02:32:41AM -0500, Johannes Weiner wrote:
> > > > > On Thu, Dec 22, 2016 at 12:22:27PM -0800, Hugh Dickins wrote:
> > > > > > On Wed, 21 Dec 2016, Linus Torvalds wrote:
> > > > > > > On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org> wrote:
> > > > > > > > I unmounted the fs, mkfs'd it again, ran the
> > > > > > > > workload again and about a minute in this fired:
> > > > > > > >
> > > > > > > > [628867.607417] ------------[ cut here ]------------
> > > > > > > > [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 shadow_lru_isolate+0x171/0x220
> > > > > > > 
> > > > > > > Well, part of the changes during the merge window were the shadow
> > > > > > > entry tracking changes that came in through Andrew's tree. Adding
> > > > > > > Johannes Weiner to the participants.
> 
> Okay, the below patch should address this problem. Dave Jones managed
> to reproduce it with the added WARN_ONs, and they made it obvious. He
> cannot trigger it anymore with this fix applied. Thanks Dave!

FWIW the patch looks good to me. I'd just note that the need to pass the
callback to deletion function and the fact that we do it only in cases
where we think it is needed appears errorprone. With the warning you've
added it should at least catch the cases where we got it wrong but more
robust would be if the radix tree root contained a pointer to the callback
function so that we would not rely on passing the callback to every place
which can possibly free a node. Also conceptually this would make more
sense to me since the fact that we may need to do some cleanup on node
deletion is a property of the particular radix tree and how we use it.
OTOH that would mean growing radix tree root by one pointer which would be
unpopular I guess.

								Honza

-- 
Jan Kara <jack-IBi9RG/b67k@public.gmane.org>
SUSE Labs, CR

-- 
You received this message because you are subscribed to the Google Groups "open-iscsi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to open-iscsi+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to open-iscsi-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
Visit this group at https://groups.google.com/group/open-iscsi.
For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0
  2017-01-09 20:30                             ` Jan Kara
  (?)
@ 2017-01-09 20:45                             ` Johannes Weiner
  -1 siblings, 0 replies; 57+ messages in thread
From: Johannes Weiner @ 2017-01-09 20:45 UTC (permalink / raw)
  To: Jan Kara
  Cc: Linus Torvalds, Andrew Morton, Hugh Dickins, Dave Chinner,
	Chris Leech, Linux Kernel Mailing List, Lee Duncan, open-iscsi,
	Linux SCSI List, linux-block, Christoph Hellwig,
	Andrea Arcangeli, Dave Jones

On Mon, Jan 09, 2017 at 09:30:05PM +0100, Jan Kara wrote:
> On Sat 07-01-17 21:02:00, Johannes Weiner wrote:
> > On Tue, Jan 03, 2017 at 01:28:25PM +0100, Jan Kara wrote:
> > > On Mon 02-01-17 16:11:36, Johannes Weiner wrote:
> > > > On Fri, Dec 23, 2016 at 03:33:29AM -0500, Johannes Weiner wrote:
> > > > > On Fri, Dec 23, 2016 at 02:32:41AM -0500, Johannes Weiner wrote:
> > > > > > On Thu, Dec 22, 2016 at 12:22:27PM -0800, Hugh Dickins wrote:
> > > > > > > On Wed, 21 Dec 2016, Linus Torvalds wrote:
> > > > > > > > On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner <david@fromorbit.com> wrote:
> > > > > > > > > I unmounted the fs, mkfs'd it again, ran the
> > > > > > > > > workload again and about a minute in this fired:
> > > > > > > > >
> > > > > > > > > [628867.607417] ------------[ cut here ]------------
> > > > > > > > > [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 shadow_lru_isolate+0x171/0x220
> > > > > > > > 
> > > > > > > > Well, part of the changes during the merge window were the shadow
> > > > > > > > entry tracking changes that came in through Andrew's tree. Adding
> > > > > > > > Johannes Weiner to the participants.
> > 
> > Okay, the below patch should address this problem. Dave Jones managed
> > to reproduce it with the added WARN_ONs, and they made it obvious. He
> > cannot trigger it anymore with this fix applied. Thanks Dave!
> 
> FWIW the patch looks good to me. I'd just note that the need to pass the
> callback to deletion function and the fact that we do it only in cases
> where we think it is needed appears errorprone. With the warning you've
> added it should at least catch the cases where we got it wrong but more
> robust would be if the radix tree root contained a pointer to the callback
> function so that we would not rely on passing the callback to every place
> which can possibly free a node. Also conceptually this would make more
> sense to me since the fact that we may need to do some cleanup on node
> deletion is a property of the particular radix tree and how we use it.
> OTOH that would mean growing radix tree root by one pointer which would be
> unpopular I guess.

The last sentence is the crux, unfortunately. The first iteration of
the shadow shrinker linked up mappings that contained shadow entries,
rather than nodes. The code would have been drastically simpler in
pretty much all regards, without changes to the radix tree API. But
Dave Chinner objected to adding a pointer to the inode when we could
stick them into empty space (slab fragmentation) inside the nodes. A
fair point, I guess, especially when you consider metadata-intense
workloads. But now we're kind of stuck with the complexity of this.

^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2017-01-09 20:45 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-14 22:24 [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0 Dave Chinner
2016-12-14 22:29 ` Dave Chinner
2016-12-16 18:59   ` Chris Leech
2016-12-21 22:16     ` Dave Chinner
2016-12-21 23:19       ` Linus Torvalds
2016-12-22  0:13         ` Chris Leech
2016-12-22  5:13           ` Dave Chinner
2016-12-22  5:46             ` Linus Torvalds
2016-12-22  6:50               ` Dave Chinner
2016-12-22 18:50                 ` Chris Leech
2016-12-22 23:53                   ` Ming Lei
2016-12-23  0:03                     ` Chris Leech
2016-12-23  0:03                       ` Chris Leech
2016-12-23 10:00                       ` Christoph Hellwig
2016-12-23 19:42                         ` Linus Torvalds
2016-12-24  2:45                           ` Jens Axboe
2016-12-24  2:45                             ` Jens Axboe
2016-12-24  9:49                             ` Christoph Hellwig
2016-12-24  9:49                               ` Christoph Hellwig
2016-12-24 10:07                           ` Christoph Hellwig
2016-12-24 10:07                             ` Christoph Hellwig
2016-12-24 13:17                             ` Hannes Reinecke
2016-12-24 13:17                               ` Hannes Reinecke
2016-12-24 13:17                               ` Hannes Reinecke
2016-12-24 13:19                               ` Christoph Hellwig
2016-12-24 13:19                                 ` Christoph Hellwig
2017-01-04 14:07                               ` Christoph Hellwig
2017-01-04 14:07                                 ` Christoph Hellwig
2016-12-22 20:22               ` Hugh Dickins
2016-12-23  7:32                 ` Johannes Weiner
2016-12-23  8:33                   ` Johannes Weiner
2017-01-02 21:11                     ` Johannes Weiner
2017-01-02 21:11                       ` Johannes Weiner
2017-01-03 12:28                       ` Jan Kara
2017-01-04 15:26                         ` Laurence Oberman
2017-01-04 17:38                           ` Laurence Oberman
2017-01-04 17:38                             ` Laurence Oberman
2017-01-08  2:02                         ` Johannes Weiner
2017-01-08  2:02                           ` Johannes Weiner
2017-01-08  2:17                           ` Linus Torvalds
2017-01-08  2:17                             ` Linus Torvalds
2017-01-09 20:30                           ` Jan Kara
2017-01-09 20:30                             ` Jan Kara
2017-01-09 20:45                             ` Johannes Weiner
2016-12-22  6:28             ` Dave Chinner
2016-12-22 17:24               ` Linus Torvalds
2016-12-22 20:20                 ` Thomas Gleixner
2016-12-22 20:42                 ` Dave Chinner
2016-12-22 21:06                   ` Dave Chinner
2016-12-22 21:10                     ` Linus Torvalds
2016-12-22 22:15                       ` Dave Chinner
2016-12-22 22:33                         ` Dave Chinner
2016-12-23  3:52                           ` Dave Chinner
2016-12-23  0:16                       ` Jens Axboe
2016-12-22  6:18         ` Christoph Hellwig
2016-12-22  6:30           ` Dave Chinner
2016-12-22  6:36             ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.