linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH v03] powerpc/mobility: Fix node detach/rename problem
       [not found] <a2d46622-a957-dffe-04d1-8087bbf0f8b5@linux.vnet.ibm.com>
@ 2018-12-11 13:29 ` Michael Ellerman
  2018-12-11 15:26   ` Michael Bringmann
                     ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Michael Ellerman @ 2018-12-11 13:29 UTC (permalink / raw)
  To: Michael Bringmann, linuxppc-dev
  Cc: Michael Bringmann, Tyrel Datwyler, Thomas Falcon, Juliet Kim,
	robh+dt, frowand.list, devicetree, linux-kernel

Hi Michael,

Please Cc the device tree folks on device tree patches, and also the
original author of the patch that added the code you're modifying.

So I've added:
  robh+dt@kernel.org
  frowand.list@gmail.com
  devicetree@vger.kernel.org
  linux-kernel@vger.kernel.org

Michael Bringmann <mwb@linux.vnet.ibm.com> writes:
> The PPC mobility code receives RTAS requests to delete nodes with
> platform-/hardware-specific attributes when restarting the kernel
> after a migration.  My example is for migration between a P8 Alpine
> and a P8 Brazos.   Nodes to be deleted include 'ibm,random-v1',
> 'ibm,platform-facilities', 'ibm,sym-encryption-v1', and,
> 'ibm,compression-v1'.
>
> The mobility.c code calls 'of_detach_node' for the nodes and their
> children.  This makes calls to detach the properties and to remove
> the associated sysfs/kernfs files.
>
> Then new copies of the same nodes are next provided by the PHYP,
> local copies are built, and a pointer to the 'struct device_node'
> is passed to of_attach_node.  Before the call to of_attach_node,
> the phandle is initialized to 0 when the data structure is alloced.
> During the call to of_attach_node, it calls __of_attach_node which
> pulls the actual name and phandle from just created sub-properties
> named something like 'name' and 'ibm,phandle'.
>
> This is all fine for the first migration.  The problem occurs with
> the second and subsequent migrations when the PHYP on the new system
> wants to replace the same set of nodes again, referenced with the
> same names and phandle values.
>
> On the second and subsequent migrations, the PHYP tells the system
> to again delete the nodes 'ibm,platform-facilities', 'ibm,random-v1',
> 'ibm,compression-v1', 'ibm,sym-encryption-v1'.  It specifies these
> nodes by its known set of phandle values -- the same handles used
> by the PHYP on the source system are known on the target system.
> The mobility.c code calls of_find_node_by_phandle() with these values
> and ends up locating the first instance of each node that was added
> during the original boot, instead of the second instance of each node
> created after the first migration.  The detach during the second
> migration fails with errors like,
>
> [ 4565.030704] WARNING: CPU: 3 PID: 4787 at drivers/of/dynamic.c:252 __of_detach_node+0x8/0xa0
> [ 4565.030708] Modules linked in: nfsv3 nfs_acl nfs tcp_diag udp_diag inet_diag unix_diag af_packet_diag netlink_diag lockd grace fscache sunrpc xts vmx_crypto sg pseries_rng binfmt_misc ip_tables xfs libcrc32c sd_mod ibmveth ibmvscsi scsi_transport_srp dm_mirror dm_region_hash dm_log dm_mod
> [ 4565.030733] CPU: 3 PID: 4787 Comm: drmgr Tainted: G        W         4.18.0-rc1-wi107836-v05-120+ #201
> [ 4565.030737] NIP:  c0000000007c1ea8 LR: c0000000007c1fb4 CTR: 0000000000655170
> [ 4565.030741] REGS: c0000003f302b690 TRAP: 0700   Tainted: G        W          (4.18.0-rc1-wi107836-v05-120+)
> [ 4565.030745] MSR:  800000010282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]>  CR: 22288822  XER: 0000000a
> [ 4565.030757] CFAR: c0000000007c1fb0 IRQMASK: 1
> [ 4565.030757] GPR00: c0000000007c1fa4 c0000003f302b910 c00000000114bf00 c0000003ffff8e68
> [ 4565.030757] GPR04: 0000000000000001 ffffffffffffffff 800000c008e0b4b8 ffffffffffffffff
> [ 4565.030757] GPR08: 0000000000000000 0000000000000001 0000000080000003 0000000000002843
> [ 4565.030757] GPR12: 0000000000008800 c00000001ec9ae00 0000000040000000 0000000000000000
> [ 4565.030757] GPR16: 0000000000000000 0000000000000008 0000000000000000 00000000f6ffffff
> [ 4565.030757] GPR20: 0000000000000007 0000000000000000 c0000003e9f1f034 0000000000000001
> [ 4565.030757] GPR24: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> [ 4565.030757] GPR28: c000000001549d28 c000000001134828 c0000003ffff8e68 c0000003f302b930
> [ 4565.030804] NIP [c0000000007c1ea8] __of_detach_node+0x8/0xa0
> [ 4565.030808] LR [c0000000007c1fb4] of_detach_node+0x74/0xd0
> [ 4565.030811] Call Trace:
> [ 4565.030815] [c0000003f302b910] [c0000000007c1fa4] of_detach_node+0x64/0xd0 (unreliable)
> [ 4565.030821] [c0000003f302b980] [c0000000000c33c4] dlpar_detach_node+0xb4/0x150
> [ 4565.030826] [c0000003f302ba10] [c0000000000c3ffc] delete_dt_node+0x3c/0x80
> [ 4565.030831] [c0000003f302ba40] [c0000000000c4380] pseries_devicetree_update+0x150/0x4f0
> [ 4565.030836] [c0000003f302bb70] [c0000000000c479c] post_mobility_fixup+0x7c/0xf0
> [ 4565.030841] [c0000003f302bbe0] [c0000000000c4908] migration_store+0xf8/0x130
> [ 4565.030847] [c0000003f302bc70] [c000000000998160] kobj_attr_store+0x30/0x60
> [ 4565.030852] [c0000003f302bc90] [c000000000412f14] sysfs_kf_write+0x64/0xa0
> [ 4565.030857] [c0000003f302bcb0] [c000000000411cac] kernfs_fop_write+0x16c/0x240
> [ 4565.030862] [c0000003f302bd00] [c000000000355f20] __vfs_write+0x40/0x220
> [ 4565.030867] [c0000003f302bd90] [c000000000356358] vfs_write+0xc8/0x240
> [ 4565.030872] [c0000003f302bde0] [c0000000003566cc] ksys_write+0x5c/0x100
> [ 4565.030880] [c0000003f302be30] [c00000000000b288] system_call+0x5c/0x70
> [ 4565.030884] Instruction dump:
> [ 4565.030887] 38210070 38600000 e8010010 eb61ffd8 eb81ffe0 eba1ffe8 ebc1fff0 ebe1fff8
> [ 4565.030895] 7c0803a6 4e800020 e9230098 7929f7e2 <0b090000> 2f890000 4cde0020 e9030040
> [ 4565.030903] ---[ end trace 5bd54cb1df9d2976 ]---
>
> The mobility.c code continues on during the second migration, accepts
> the definitions of the new nodes from the PHYP and ends up renaming
> the new properties e.g.
>
> [ 4565.827296] Duplicate name in base, renamed to "ibm,platform-facilities#1"
>
> There is no check like 'of_node_check_flag(np, OF_DETACHED)' within
> of_find_node_by_phandle to skip nodes that are detached, but still
> present due to caching or use count considerations.  Also, note that
> of_find_node_by_phandle also uses a 'phandle_cache' which does not
> appear to be updated when of_detach_node() is invoked.

This seems like the real bug. Since the phandle cache was added we can
now find detached nodes when we shouldn't be able to.

Does the patch below work?

cheers

diff --git a/drivers/of/base.c b/drivers/of/base.c
index 09692c9b32a7..d8e4534c0686 100644
--- a/drivers/of/base.c
+++ b/drivers/of/base.c
@@ -1190,6 +1190,10 @@ struct device_node *of_find_node_by_phandle(phandle handle)
 		if (phandle_cache[masked_handle] &&
 		    handle == phandle_cache[masked_handle]->phandle)
 			np = phandle_cache[masked_handle];
+
+		/* If we find a detached node, remove it */
+		if (of_node_check_flag(np, OF_DETACHED))
+			np = phandle_cache[masked_handle] = NULL;
 	}
 
 	if (!np) {

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v03] powerpc/mobility: Fix node detach/rename problem
  2018-12-11 13:29 ` [PATCH v03] powerpc/mobility: Fix node detach/rename problem Michael Ellerman
@ 2018-12-11 15:26   ` Michael Bringmann
  2018-12-11 16:07   ` Rob Herring
  2018-12-11 16:43   ` Michael Bringmann
  2 siblings, 0 replies; 8+ messages in thread
From: Michael Bringmann @ 2018-12-11 15:26 UTC (permalink / raw)
  To: Michael Ellerman, linuxppc-dev
  Cc: devicetree, Thomas Falcon, linux-kernel, robh+dt, Juliet Kim,
	Tyrel Datwyler, frowand.list



On 12/11/2018 07:29 AM, Michael Ellerman wrote:
> Hi Michael,
> 
> Please Cc the device tree folks on device tree patches, and also the
> original author of the patch that added the code you're modifying.
> 
> So I've added:
>   robh+dt@kernel.org
>   frowand.list@gmail.com
>   devicetree@vger.kernel.org
>   linux-kernel@vger.kernel.org

Thanks.

> 
> Michael Bringmann <mwb@linux.vnet.ibm.com> writes:
>> The PPC mobility code receives RTAS requests to delete nodes with
>> platform-/hardware-specific attributes when restarting the kernel
>> after a migration.  My example is for migration between a P8 Alpine
>> and a P8 Brazos.   Nodes to be deleted include 'ibm,random-v1',
>> 'ibm,platform-facilities', 'ibm,sym-encryption-v1', and,
>> 'ibm,compression-v1'.
>>
>> The mobility.c code calls 'of_detach_node' for the nodes and their
>> children.  This makes calls to detach the properties and to remove
>> the associated sysfs/kernfs files.
>>
>> Then new copies of the same nodes are next provided by the PHYP,
>> local copies are built, and a pointer to the 'struct device_node'
>> is passed to of_attach_node.  Before the call to of_attach_node,
>> the phandle is initialized to 0 when the data structure is alloced.
>> During the call to of_attach_node, it calls __of_attach_node which
>> pulls the actual name and phandle from just created sub-properties
>> named something like 'name' and 'ibm,phandle'.
>>
>> This is all fine for the first migration.  The problem occurs with
>> the second and subsequent migrations when the PHYP on the new system
>> wants to replace the same set of nodes again, referenced with the
>> same names and phandle values.
>>
>> On the second and subsequent migrations, the PHYP tells the system
>> to again delete the nodes 'ibm,platform-facilities', 'ibm,random-v1',
>> 'ibm,compression-v1', 'ibm,sym-encryption-v1'.  It specifies these
>> nodes by its known set of phandle values -- the same handles used
>> by the PHYP on the source system are known on the target system.
>> The mobility.c code calls of_find_node_by_phandle() with these values
>> and ends up locating the first instance of each node that was added
>> during the original boot, instead of the second instance of each node
>> created after the first migration.  The detach during the second
>> migration fails with errors like,
>>
>> [ 4565.030704] WARNING: CPU: 3 PID: 4787 at drivers/of/dynamic.c:252 __of_detach_node+0x8/0xa0
>> [ 4565.030708] Modules linked in: nfsv3 nfs_acl nfs tcp_diag udp_diag inet_diag unix_diag af_packet_diag netlink_diag lockd grace fscache sunrpc xts vmx_crypto sg pseries_rng binfmt_misc ip_tables xfs libcrc32c sd_mod ibmveth ibmvscsi scsi_transport_srp dm_mirror dm_region_hash dm_log dm_mod
>> [ 4565.030733] CPU: 3 PID: 4787 Comm: drmgr Tainted: G        W         4.18.0-rc1-wi107836-v05-120+ #201
>> [ 4565.030737] NIP:  c0000000007c1ea8 LR: c0000000007c1fb4 CTR: 0000000000655170
>> [ 4565.030741] REGS: c0000003f302b690 TRAP: 0700   Tainted: G        W          (4.18.0-rc1-wi107836-v05-120+)
>> [ 4565.030745] MSR:  800000010282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]>  CR: 22288822  XER: 0000000a
>> [ 4565.030757] CFAR: c0000000007c1fb0 IRQMASK: 1
>> [ 4565.030757] GPR00: c0000000007c1fa4 c0000003f302b910 c00000000114bf00 c0000003ffff8e68
>> [ 4565.030757] GPR04: 0000000000000001 ffffffffffffffff 800000c008e0b4b8 ffffffffffffffff
>> [ 4565.030757] GPR08: 0000000000000000 0000000000000001 0000000080000003 0000000000002843
>> [ 4565.030757] GPR12: 0000000000008800 c00000001ec9ae00 0000000040000000 0000000000000000
>> [ 4565.030757] GPR16: 0000000000000000 0000000000000008 0000000000000000 00000000f6ffffff
>> [ 4565.030757] GPR20: 0000000000000007 0000000000000000 c0000003e9f1f034 0000000000000001
>> [ 4565.030757] GPR24: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>> [ 4565.030757] GPR28: c000000001549d28 c000000001134828 c0000003ffff8e68 c0000003f302b930
>> [ 4565.030804] NIP [c0000000007c1ea8] __of_detach_node+0x8/0xa0
>> [ 4565.030808] LR [c0000000007c1fb4] of_detach_node+0x74/0xd0
>> [ 4565.030811] Call Trace:
>> [ 4565.030815] [c0000003f302b910] [c0000000007c1fa4] of_detach_node+0x64/0xd0 (unreliable)
>> [ 4565.030821] [c0000003f302b980] [c0000000000c33c4] dlpar_detach_node+0xb4/0x150
>> [ 4565.030826] [c0000003f302ba10] [c0000000000c3ffc] delete_dt_node+0x3c/0x80
>> [ 4565.030831] [c0000003f302ba40] [c0000000000c4380] pseries_devicetree_update+0x150/0x4f0
>> [ 4565.030836] [c0000003f302bb70] [c0000000000c479c] post_mobility_fixup+0x7c/0xf0
>> [ 4565.030841] [c0000003f302bbe0] [c0000000000c4908] migration_store+0xf8/0x130
>> [ 4565.030847] [c0000003f302bc70] [c000000000998160] kobj_attr_store+0x30/0x60
>> [ 4565.030852] [c0000003f302bc90] [c000000000412f14] sysfs_kf_write+0x64/0xa0
>> [ 4565.030857] [c0000003f302bcb0] [c000000000411cac] kernfs_fop_write+0x16c/0x240
>> [ 4565.030862] [c0000003f302bd00] [c000000000355f20] __vfs_write+0x40/0x220
>> [ 4565.030867] [c0000003f302bd90] [c000000000356358] vfs_write+0xc8/0x240
>> [ 4565.030872] [c0000003f302bde0] [c0000000003566cc] ksys_write+0x5c/0x100
>> [ 4565.030880] [c0000003f302be30] [c00000000000b288] system_call+0x5c/0x70
>> [ 4565.030884] Instruction dump:
>> [ 4565.030887] 38210070 38600000 e8010010 eb61ffd8 eb81ffe0 eba1ffe8 ebc1fff0 ebe1fff8
>> [ 4565.030895] 7c0803a6 4e800020 e9230098 7929f7e2 <0b090000> 2f890000 4cde0020 e9030040
>> [ 4565.030903] ---[ end trace 5bd54cb1df9d2976 ]---
>>
>> The mobility.c code continues on during the second migration, accepts
>> the definitions of the new nodes from the PHYP and ends up renaming
>> the new properties e.g.
>>
>> [ 4565.827296] Duplicate name in base, renamed to "ibm,platform-facilities#1"
>>
>> There is no check like 'of_node_check_flag(np, OF_DETACHED)' within
>> of_find_node_by_phandle to skip nodes that are detached, but still
>> present due to caching or use count considerations.  Also, note that
>> of_find_node_by_phandle also uses a 'phandle_cache' which does not
>> appear to be updated when of_detach_node() is invoked.
> 
> This seems like the real bug. Since the phandle cache was added we can
> now find detached nodes when we shouldn't be able to.
> 
> Does the patch below work?
> 
> cheers
> 
> diff --git a/drivers/of/base.c b/drivers/of/base.c
> index 09692c9b32a7..d8e4534c0686 100644
> --- a/drivers/of/base.c
> +++ b/drivers/of/base.c
> @@ -1190,6 +1190,10 @@ struct device_node *of_find_node_by_phandle(phandle handle)
>  		if (phandle_cache[masked_handle] &&
>  		    handle == phandle_cache[masked_handle]->phandle)
>  			np = phandle_cache[masked_handle];
> +
> +		/* If we find a detached node, remove it */
> +		if (of_node_check_flag(np, OF_DETACHED))
> +			np = phandle_cache[masked_handle] = NULL;
>  	}
> 
>  	if (!np) {
> 
> 

-- 
Michael W. Bringmann
Linux Technology Center
IBM Corporation
Tie-Line  363-5196
External: (512) 286-5196
Cell:       (512) 466-0650
mwb@linux.vnet.ibm.com


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v03] powerpc/mobility: Fix node detach/rename problem
  2018-12-11 13:29 ` [PATCH v03] powerpc/mobility: Fix node detach/rename problem Michael Ellerman
  2018-12-11 15:26   ` Michael Bringmann
@ 2018-12-11 16:07   ` Rob Herring
  2018-12-12 22:00     ` Frank Rowand
  2018-12-13  2:56     ` Michael Ellerman
  2018-12-11 16:43   ` Michael Bringmann
  2 siblings, 2 replies; 8+ messages in thread
From: Rob Herring @ 2018-12-11 16:07 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: mwb, linuxppc-dev, Tyrel Datwyler, tlfalcon, minkim,
	Frank Rowand, devicetree, linux-kernel

On Tue, Dec 11, 2018 at 7:29 AM Michael Ellerman <mpe@ellerman.id.au> wrote:
>
> Hi Michael,
>
> Please Cc the device tree folks on device tree patches, and also the
> original author of the patch that added the code you're modifying.
>
> So I've added:
>   robh+dt@kernel.org
>   frowand.list@gmail.com
>   devicetree@vger.kernel.org
>   linux-kernel@vger.kernel.org
>
> Michael Bringmann <mwb@linux.vnet.ibm.com> writes:
> > The PPC mobility code receives RTAS requests to delete nodes with
> > platform-/hardware-specific attributes when restarting the kernel
> > after a migration.  My example is for migration between a P8 Alpine
> > and a P8 Brazos.   Nodes to be deleted include 'ibm,random-v1',
> > 'ibm,platform-facilities', 'ibm,sym-encryption-v1', and,
> > 'ibm,compression-v1'.
> >
> > The mobility.c code calls 'of_detach_node' for the nodes and their
> > children.  This makes calls to detach the properties and to remove
> > the associated sysfs/kernfs files.
> >
> > Then new copies of the same nodes are next provided by the PHYP,
> > local copies are built, and a pointer to the 'struct device_node'
> > is passed to of_attach_node.  Before the call to of_attach_node,
> > the phandle is initialized to 0 when the data structure is alloced.
> > During the call to of_attach_node, it calls __of_attach_node which
> > pulls the actual name and phandle from just created sub-properties
> > named something like 'name' and 'ibm,phandle'.
> >
> > This is all fine for the first migration.  The problem occurs with
> > the second and subsequent migrations when the PHYP on the new system
> > wants to replace the same set of nodes again, referenced with the
> > same names and phandle values.
> >
> > On the second and subsequent migrations, the PHYP tells the system
> > to again delete the nodes 'ibm,platform-facilities', 'ibm,random-v1',
> > 'ibm,compression-v1', 'ibm,sym-encryption-v1'.  It specifies these
> > nodes by its known set of phandle values -- the same handles used
> > by the PHYP on the source system are known on the target system.
> > The mobility.c code calls of_find_node_by_phandle() with these values
> > and ends up locating the first instance of each node that was added
> > during the original boot, instead of the second instance of each node
> > created after the first migration.  The detach during the second
> > migration fails with errors like,
> >
> > [ 4565.030704] WARNING: CPU: 3 PID: 4787 at drivers/of/dynamic.c:252 __of_detach_node+0x8/0xa0
> > [ 4565.030708] Modules linked in: nfsv3 nfs_acl nfs tcp_diag udp_diag inet_diag unix_diag af_packet_diag netlink_diag lockd grace fscache sunrpc xts vmx_crypto sg pseries_rng binfmt_misc ip_tables xfs libcrc32c sd_mod ibmveth ibmvscsi scsi_transport_srp dm_mirror dm_region_hash dm_log dm_mod
> > [ 4565.030733] CPU: 3 PID: 4787 Comm: drmgr Tainted: G        W         4.18.0-rc1-wi107836-v05-120+ #201
> > [ 4565.030737] NIP:  c0000000007c1ea8 LR: c0000000007c1fb4 CTR: 0000000000655170
> > [ 4565.030741] REGS: c0000003f302b690 TRAP: 0700   Tainted: G        W          (4.18.0-rc1-wi107836-v05-120+)
> > [ 4565.030745] MSR:  800000010282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]>  CR: 22288822  XER: 0000000a
> > [ 4565.030757] CFAR: c0000000007c1fb0 IRQMASK: 1
> > [ 4565.030757] GPR00: c0000000007c1fa4 c0000003f302b910 c00000000114bf00 c0000003ffff8e68
> > [ 4565.030757] GPR04: 0000000000000001 ffffffffffffffff 800000c008e0b4b8 ffffffffffffffff
> > [ 4565.030757] GPR08: 0000000000000000 0000000000000001 0000000080000003 0000000000002843
> > [ 4565.030757] GPR12: 0000000000008800 c00000001ec9ae00 0000000040000000 0000000000000000
> > [ 4565.030757] GPR16: 0000000000000000 0000000000000008 0000000000000000 00000000f6ffffff
> > [ 4565.030757] GPR20: 0000000000000007 0000000000000000 c0000003e9f1f034 0000000000000001
> > [ 4565.030757] GPR24: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> > [ 4565.030757] GPR28: c000000001549d28 c000000001134828 c0000003ffff8e68 c0000003f302b930
> > [ 4565.030804] NIP [c0000000007c1ea8] __of_detach_node+0x8/0xa0
> > [ 4565.030808] LR [c0000000007c1fb4] of_detach_node+0x74/0xd0
> > [ 4565.030811] Call Trace:
> > [ 4565.030815] [c0000003f302b910] [c0000000007c1fa4] of_detach_node+0x64/0xd0 (unreliable)
> > [ 4565.030821] [c0000003f302b980] [c0000000000c33c4] dlpar_detach_node+0xb4/0x150
> > [ 4565.030826] [c0000003f302ba10] [c0000000000c3ffc] delete_dt_node+0x3c/0x80
> > [ 4565.030831] [c0000003f302ba40] [c0000000000c4380] pseries_devicetree_update+0x150/0x4f0
> > [ 4565.030836] [c0000003f302bb70] [c0000000000c479c] post_mobility_fixup+0x7c/0xf0
> > [ 4565.030841] [c0000003f302bbe0] [c0000000000c4908] migration_store+0xf8/0x130
> > [ 4565.030847] [c0000003f302bc70] [c000000000998160] kobj_attr_store+0x30/0x60
> > [ 4565.030852] [c0000003f302bc90] [c000000000412f14] sysfs_kf_write+0x64/0xa0
> > [ 4565.030857] [c0000003f302bcb0] [c000000000411cac] kernfs_fop_write+0x16c/0x240
> > [ 4565.030862] [c0000003f302bd00] [c000000000355f20] __vfs_write+0x40/0x220
> > [ 4565.030867] [c0000003f302bd90] [c000000000356358] vfs_write+0xc8/0x240
> > [ 4565.030872] [c0000003f302bde0] [c0000000003566cc] ksys_write+0x5c/0x100
> > [ 4565.030880] [c0000003f302be30] [c00000000000b288] system_call+0x5c/0x70
> > [ 4565.030884] Instruction dump:
> > [ 4565.030887] 38210070 38600000 e8010010 eb61ffd8 eb81ffe0 eba1ffe8 ebc1fff0 ebe1fff8
> > [ 4565.030895] 7c0803a6 4e800020 e9230098 7929f7e2 <0b090000> 2f890000 4cde0020 e9030040
> > [ 4565.030903] ---[ end trace 5bd54cb1df9d2976 ]---
> >
> > The mobility.c code continues on during the second migration, accepts
> > the definitions of the new nodes from the PHYP and ends up renaming
> > the new properties e.g.
> >
> > [ 4565.827296] Duplicate name in base, renamed to "ibm,platform-facilities#1"
> >
> > There is no check like 'of_node_check_flag(np, OF_DETACHED)' within
> > of_find_node_by_phandle to skip nodes that are detached, but still
> > present due to caching or use count considerations.  Also, note that
> > of_find_node_by_phandle also uses a 'phandle_cache' which does not
> > appear to be updated when of_detach_node() is invoked.
>
> This seems like the real bug. Since the phandle cache was added we can
> now find detached nodes when we shouldn't be able to.
>
> Does the patch below work?
>
> cheers
>
> diff --git a/drivers/of/base.c b/drivers/of/base.c
> index 09692c9b32a7..d8e4534c0686 100644
> --- a/drivers/of/base.c
> +++ b/drivers/of/base.c
> @@ -1190,6 +1190,10 @@ struct device_node *of_find_node_by_phandle(phandle handle)
>                 if (phandle_cache[masked_handle] &&
>                     handle == phandle_cache[masked_handle]->phandle)
>                         np = phandle_cache[masked_handle];
> +
> +               /* If we find a detached node, remove it */
> +               if (of_node_check_flag(np, OF_DETACHED))
> +                       np = phandle_cache[masked_handle] = NULL;

I'm wondering if we should explicitly remove the node from the cache
when we set OF_DETACHED. Otherwise, it could be possible that the node
pointer has been freed already. Or maybe we need both?

Rob

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v03] powerpc/mobility: Fix node detach/rename problem
  2018-12-11 13:29 ` [PATCH v03] powerpc/mobility: Fix node detach/rename problem Michael Ellerman
  2018-12-11 15:26   ` Michael Bringmann
  2018-12-11 16:07   ` Rob Herring
@ 2018-12-11 16:43   ` Michael Bringmann
  2 siblings, 0 replies; 8+ messages in thread
From: Michael Bringmann @ 2018-12-11 16:43 UTC (permalink / raw)
  To: Michael Ellerman, linuxppc-dev
  Cc: devicetree, Thomas Falcon, linux-kernel, robh+dt, Juliet Kim,
	Tyrel Datwyler, frowand.list

--- Snip ---

>>
>> The mobility.c code continues on during the second migration, accepts
>> the definitions of the new nodes from the PHYP and ends up renaming
>> the new properties e.g.
>>
>> [ 4565.827296] Duplicate name in base, renamed to "ibm,platform-facilities#1"
>>
>> There is no check like 'of_node_check_flag(np, OF_DETACHED)' within
>> of_find_node_by_phandle to skip nodes that are detached, but still
>> present due to caching or use count considerations.  Also, note that
>> of_find_node_by_phandle also uses a 'phandle_cache' which does not
>> appear to be updated when of_detach_node() is invoked.
> 
> This seems like the real bug. Since the phandle cache was added we can
> now find detached nodes when we shouldn't be able to.
> 
> Does the patch below work?
> 
> cheers
> 
> diff --git a/drivers/of/base.c b/drivers/of/base.c
> index 09692c9b32a7..d8e4534c0686 100644
> --- a/drivers/of/base.c
> +++ b/drivers/of/base.c
> @@ -1190,6 +1190,10 @@ struct device_node *of_find_node_by_phandle(phandle handle)
>  		if (phandle_cache[masked_handle] &&
>  		    handle == phandle_cache[masked_handle]->phandle)
>  			np = phandle_cache[masked_handle];
> +
> +		/* If we find a detached node, remove it */
> +		if (of_node_check_flag(np, OF_DETACHED))
> +			np = phandle_cache[masked_handle] = NULL;
>  	}
> 
>  	if (!np) {
> 
> 

I think this would be a bit better for cases where masked values overlap:

diff --git a/drivers/of/base.c b/drivers/of/base.c
index 09692c9..ec79129 100644
--- a/drivers/of/base.c
+++ b/drivers/of/base.c
@@ -1188,8 +1188,13 @@ struct device_node *of_find_node_by_phandle(phandle handle)
 
 	if (phandle_cache) {
 		if (phandle_cache[masked_handle] &&
-		    handle == phandle_cache[masked_handle]->phandle)
-			np = phandle_cache[masked_handle];
+		    handle == phandle_cache[masked_handle]->phandle) {
+				np = phandle_cache[masked_handle];
+
+			/* If we find a detached node, remove it */
+			if (of_node_check_flag(np, OF_DETACHED))
+				np = phandle_cache[masked_handle] = NULL;
+		}
 	}
 
 	if (!np) {


Will try it out.  Wouldn't it be better to do this when the node is detached
in drivers/of/dynamic.c:__of_detach_node()?

Thanks.
Michael

-- 
Michael W. Bringmann
Linux Technology Center
IBM Corporation
Tie-Line  363-5196
External: (512) 286-5196
Cell:       (512) 466-0650
mwb@linux.vnet.ibm.com


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v03] powerpc/mobility: Fix node detach/rename problem
  2018-12-11 16:07   ` Rob Herring
@ 2018-12-12 22:00     ` Frank Rowand
  2018-12-13  2:57       ` Michael Ellerman
  2018-12-13  2:56     ` Michael Ellerman
  1 sibling, 1 reply; 8+ messages in thread
From: Frank Rowand @ 2018-12-12 22:00 UTC (permalink / raw)
  To: Rob Herring, Michael Ellerman
  Cc: mwb, linuxppc-dev, Tyrel Datwyler, tlfalcon, minkim, devicetree,
	linux-kernel

Hi Michael Bringmann,

On 12/11/18 8:07 AM, Rob Herring wrote:
> On Tue, Dec 11, 2018 at 7:29 AM Michael Ellerman <mpe@ellerman.id.au> wrote:
>>
>> Hi Michael,
>>
>> Please Cc the device tree folks on device tree patches, and also the
>> original author of the patch that added the code you're modifying.
>>
>> So I've added:
>>   robh+dt@kernel.org
>>   frowand.list@gmail.com
>>   devicetree@vger.kernel.org
>>   linux-kernel@vger.kernel.org
>>
>> Michael Bringmann <mwb@linux.vnet.ibm.com> writes:
>>> The PPC mobility code receives RTAS requests to delete nodes with
>>> platform-/hardware-specific attributes when restarting the kernel
>>> after a migration.  My example is for migration between a P8 Alpine
>>> and a P8 Brazos.   Nodes to be deleted include 'ibm,random-v1',
>>> 'ibm,platform-facilities', 'ibm,sym-encryption-v1', and,
>>> 'ibm,compression-v1'.
>>>
>>> The mobility.c code calls 'of_detach_node' for the nodes and their
>>> children.  This makes calls to detach the properties and to remove
>>> the associated sysfs/kernfs files.
>>>
>>> Then new copies of the same nodes are next provided by the PHYP,
>>> local copies are built, and a pointer to the 'struct device_node'
>>> is passed to of_attach_node.  Before the call to of_attach_node,
>>> the phandle is initialized to 0 when the data structure is alloced.
>>> During the call to of_attach_node, it calls __of_attach_node which
>>> pulls the actual name and phandle from just created sub-properties
>>> named something like 'name' and 'ibm,phandle'.
>>>
>>> This is all fine for the first migration.  The problem occurs with
>>> the second and subsequent migrations when the PHYP on the new system
>>> wants to replace the same set of nodes again, referenced with the
>>> same names and phandle values.
>>>
>>> On the second and subsequent migrations, the PHYP tells the system
>>> to again delete the nodes 'ibm,platform-facilities', 'ibm,random-v1',
>>> 'ibm,compression-v1', 'ibm,sym-encryption-v1'.  It specifies these
>>> nodes by its known set of phandle values -- the same handles used
>>> by the PHYP on the source system are known on the target system.
>>> The mobility.c code calls of_find_node_by_phandle() with these values
>>> and ends up locating the first instance of each node that was added
>>> during the original boot, instead of the second instance of each node
>>> created after the first migration.  The detach during the second
>>> migration fails with errors like,
>>>
>>> [ 4565.030704] WARNING: CPU: 3 PID: 4787 at drivers/of/dynamic.c:252 __of_detach_node+0x8/0xa0
>>> [ 4565.030708] Modules linked in: nfsv3 nfs_acl nfs tcp_diag udp_diag inet_diag unix_diag af_packet_diag netlink_diag lockd grace fscache sunrpc xts vmx_crypto sg pseries_rng binfmt_misc ip_tables xfs libcrc32c sd_mod ibmveth ibmvscsi scsi_transport_srp dm_mirror dm_region_hash dm_log dm_mod
>>> [ 4565.030733] CPU: 3 PID: 4787 Comm: drmgr Tainted: G        W         4.18.0-rc1-wi107836-v05-120+ #201
>>> [ 4565.030737] NIP:  c0000000007c1ea8 LR: c0000000007c1fb4 CTR: 0000000000655170
>>> [ 4565.030741] REGS: c0000003f302b690 TRAP: 0700   Tainted: G        W          (4.18.0-rc1-wi107836-v05-120+)
>>> [ 4565.030745] MSR:  800000010282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]>  CR: 22288822  XER: 0000000a
>>> [ 4565.030757] CFAR: c0000000007c1fb0 IRQMASK: 1
>>> [ 4565.030757] GPR00: c0000000007c1fa4 c0000003f302b910 c00000000114bf00 c0000003ffff8e68
>>> [ 4565.030757] GPR04: 0000000000000001 ffffffffffffffff 800000c008e0b4b8 ffffffffffffffff
>>> [ 4565.030757] GPR08: 0000000000000000 0000000000000001 0000000080000003 0000000000002843
>>> [ 4565.030757] GPR12: 0000000000008800 c00000001ec9ae00 0000000040000000 0000000000000000
>>> [ 4565.030757] GPR16: 0000000000000000 0000000000000008 0000000000000000 00000000f6ffffff
>>> [ 4565.030757] GPR20: 0000000000000007 0000000000000000 c0000003e9f1f034 0000000000000001
>>> [ 4565.030757] GPR24: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>>> [ 4565.030757] GPR28: c000000001549d28 c000000001134828 c0000003ffff8e68 c0000003f302b930
>>> [ 4565.030804] NIP [c0000000007c1ea8] __of_detach_node+0x8/0xa0
>>> [ 4565.030808] LR [c0000000007c1fb4] of_detach_node+0x74/0xd0
>>> [ 4565.030811] Call Trace:
>>> [ 4565.030815] [c0000003f302b910] [c0000000007c1fa4] of_detach_node+0x64/0xd0 (unreliable)
>>> [ 4565.030821] [c0000003f302b980] [c0000000000c33c4] dlpar_detach_node+0xb4/0x150
>>> [ 4565.030826] [c0000003f302ba10] [c0000000000c3ffc] delete_dt_node+0x3c/0x80
>>> [ 4565.030831] [c0000003f302ba40] [c0000000000c4380] pseries_devicetree_update+0x150/0x4f0
>>> [ 4565.030836] [c0000003f302bb70] [c0000000000c479c] post_mobility_fixup+0x7c/0xf0
>>> [ 4565.030841] [c0000003f302bbe0] [c0000000000c4908] migration_store+0xf8/0x130
>>> [ 4565.030847] [c0000003f302bc70] [c000000000998160] kobj_attr_store+0x30/0x60
>>> [ 4565.030852] [c0000003f302bc90] [c000000000412f14] sysfs_kf_write+0x64/0xa0
>>> [ 4565.030857] [c0000003f302bcb0] [c000000000411cac] kernfs_fop_write+0x16c/0x240
>>> [ 4565.030862] [c0000003f302bd00] [c000000000355f20] __vfs_write+0x40/0x220
>>> [ 4565.030867] [c0000003f302bd90] [c000000000356358] vfs_write+0xc8/0x240
>>> [ 4565.030872] [c0000003f302bde0] [c0000000003566cc] ksys_write+0x5c/0x100
>>> [ 4565.030880] [c0000003f302be30] [c00000000000b288] system_call+0x5c/0x70
>>> [ 4565.030884] Instruction dump:
>>> [ 4565.030887] 38210070 38600000 e8010010 eb61ffd8 eb81ffe0 eba1ffe8 ebc1fff0 ebe1fff8
>>> [ 4565.030895] 7c0803a6 4e800020 e9230098 7929f7e2 <0b090000> 2f890000 4cde0020 e9030040
>>> [ 4565.030903] ---[ end trace 5bd54cb1df9d2976 ]---
>>>
>>> The mobility.c code continues on during the second migration, accepts
>>> the definitions of the new nodes from the PHYP and ends up renaming
>>> the new properties e.g.
>>>
>>> [ 4565.827296] Duplicate name in base, renamed to "ibm,platform-facilities#1"
>>>
>>> There is no check like 'of_node_check_flag(np, OF_DETACHED)' within
>>> of_find_node_by_phandle to skip nodes that are detached, but still
>>> present due to caching or use count considerations.  Also, note that
>>> of_find_node_by_phandle also uses a 'phandle_cache' which does not
>>> appear to be updated when of_detach_node() is invoked.
>>
>> This seems like the real bug. Since the phandle cache was added we can
>> now find detached nodes when we shouldn't be able to.
>>
>> Does the patch below work?
>>
>> cheers
>>
>> diff --git a/drivers/of/base.c b/drivers/of/base.c
>> index 09692c9b32a7..d8e4534c0686 100644
>> --- a/drivers/of/base.c
>> +++ b/drivers/of/base.c
>> @@ -1190,6 +1190,10 @@ struct device_node *of_find_node_by_phandle(phandle handle)
>>                 if (phandle_cache[masked_handle] &&
>>                     handle == phandle_cache[masked_handle]->phandle)
>>                         np = phandle_cache[masked_handle];
>> +
>> +               /* If we find a detached node, remove it */
>> +               if (of_node_check_flag(np, OF_DETACHED))
>> +                       np = phandle_cache[masked_handle] = NULL;

The bug you found exposes a couple of different issues, a little bit
deeper than the proposed fix.  I'll work on a fuller fix tonight or
tomorrow.


> I'm wondering if we should explicitly remove the node from the cache
> when we set OF_DETACHED. Otherwise, it could be possible that the node
> pointer has been freed already. Or maybe we need both?

Yes, it should be explicitly removed.  I may also add in a paranoia check in
of_find_node_by_phandle().

-Frank

> 
> Rob
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v03] powerpc/mobility: Fix node detach/rename problem
  2018-12-11 16:07   ` Rob Herring
  2018-12-12 22:00     ` Frank Rowand
@ 2018-12-13  2:56     ` Michael Ellerman
  1 sibling, 0 replies; 8+ messages in thread
From: Michael Ellerman @ 2018-12-13  2:56 UTC (permalink / raw)
  To: Rob Herring
  Cc: mwb, linuxppc-dev, Tyrel Datwyler, tlfalcon, minkim,
	Frank Rowand, devicetree, linux-kernel

Rob Herring <robh+dt@kernel.org> writes:
> On Tue, Dec 11, 2018 at 7:29 AM Michael Ellerman <mpe@ellerman.id.au> wrote:
...
>> diff --git a/drivers/of/base.c b/drivers/of/base.c
>> index 09692c9b32a7..d8e4534c0686 100644
>> --- a/drivers/of/base.c
>> +++ b/drivers/of/base.c
>> @@ -1190,6 +1190,10 @@ struct device_node *of_find_node_by_phandle(phandle handle)
>>                 if (phandle_cache[masked_handle] &&
>>                     handle == phandle_cache[masked_handle]->phandle)
>>                         np = phandle_cache[masked_handle];
>> +
>> +               /* If we find a detached node, remove it */
>> +               if (of_node_check_flag(np, OF_DETACHED))
>> +                       np = phandle_cache[masked_handle] = NULL;
>
> I'm wondering if we should explicitly remove the node from the cache
> when we set OF_DETACHED. Otherwise, it could be possible that the node
> pointer has been freed already.

Yeah good point.

> Or maybe we need both?

That's probably best, it could even be a WARN_ON() if we find one in
of_find_node_by_phandle().

cheers

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v03] powerpc/mobility: Fix node detach/rename problem
  2018-12-12 22:00     ` Frank Rowand
@ 2018-12-13  2:57       ` Michael Ellerman
  2018-12-14 21:58         ` Michael Bringmann
  0 siblings, 1 reply; 8+ messages in thread
From: Michael Ellerman @ 2018-12-13  2:57 UTC (permalink / raw)
  To: Frank Rowand, Rob Herring
  Cc: mwb, linuxppc-dev, Tyrel Datwyler, tlfalcon, minkim, devicetree,
	linux-kernel

Frank Rowand <frowand.list@gmail.com> writes:
> On 12/11/18 8:07 AM, Rob Herring wrote:
>> On Tue, Dec 11, 2018 at 7:29 AM Michael Ellerman <mpe@ellerman.id.au> wrote:
...
>>> diff --git a/drivers/of/base.c b/drivers/of/base.c
>>> index 09692c9b32a7..d8e4534c0686 100644
>>> --- a/drivers/of/base.c
>>> +++ b/drivers/of/base.c
>>> @@ -1190,6 +1190,10 @@ struct device_node *of_find_node_by_phandle(phandle handle)
>>>                 if (phandle_cache[masked_handle] &&
>>>                     handle == phandle_cache[masked_handle]->phandle)
>>>                         np = phandle_cache[masked_handle];
>>> +
>>> +               /* If we find a detached node, remove it */
>>> +               if (of_node_check_flag(np, OF_DETACHED))
>>> +                       np = phandle_cache[masked_handle] = NULL;
>
> The bug you found exposes a couple of different issues, a little bit
> deeper than the proposed fix.  I'll work on a fuller fix tonight or
> tomorrow.

OK thanks.

>> I'm wondering if we should explicitly remove the node from the cache
>> when we set OF_DETACHED. Otherwise, it could be possible that the node
>> pointer has been freed already. Or maybe we need both?
>
> Yes, it should be explicitly removed.  I may also add in a paranoia check in
> of_find_node_by_phandle().

That seems best to me.

cheers

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v03] powerpc/mobility: Fix node detach/rename problem
  2018-12-13  2:57       ` Michael Ellerman
@ 2018-12-14 21:58         ` Michael Bringmann
  0 siblings, 0 replies; 8+ messages in thread
From: Michael Bringmann @ 2018-12-14 21:58 UTC (permalink / raw)
  To: Michael Ellerman, Frank Rowand, Rob Herring
  Cc: devicetree, tlfalcon, linux-kernel, minkim, Tyrel Datwyler, linuxppc-dev

On 12/12/2018 08:57 PM, Michael Ellerman wrote:
> Frank Rowand <frowand.list@gmail.com> writes:
>> On 12/11/18 8:07 AM, Rob Herring wrote:
>>> On Tue, Dec 11, 2018 at 7:29 AM Michael Ellerman <mpe@ellerman.id.au> wrote:
> ...
>>>> diff --git a/drivers/of/base.c b/drivers/of/base.c
>>>> index 09692c9b32a7..d8e4534c0686 100644
>>>> --- a/drivers/of/base.c
>>>> +++ b/drivers/of/base.c
>>>> @@ -1190,6 +1190,10 @@ struct device_node *of_find_node_by_phandle(phandle handle)
>>>>                 if (phandle_cache[masked_handle] &&
>>>>                     handle == phandle_cache[masked_handle]->phandle)
>>>>                         np = phandle_cache[masked_handle];
>>>> +
>>>> +               /* If we find a detached node, remove it */
>>>> +               if (of_node_check_flag(np, OF_DETACHED))
>>>> +                       np = phandle_cache[masked_handle] = NULL;
>>
>> The bug you found exposes a couple of different issues, a little bit
>> deeper than the proposed fix.  I'll work on a fuller fix tonight or
>> tomorrow.
> 
> OK thanks.
> 
>>> I'm wondering if we should explicitly remove the node from the cache
>>> when we set OF_DETACHED. Otherwise, it could be possible that the node
>>> pointer has been freed already. Or maybe we need both?
>>
>> Yes, it should be explicitly removed.  I may also add in a paranoia check in
>> of_find_node_by_phandle().
> 
> That seems best to me.

I agree that we should do both.

> 
> cheers

Michael

-- 
Michael W. Bringmann
Linux I/O, Networking and Security Development
IBM Corporation
Tie-Line  363-5196
External: (512) 286-5196
Cell:       (512) 466-0650
mwb@linux.vnet.ibm.com


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2018-12-14 21:58 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <a2d46622-a957-dffe-04d1-8087bbf0f8b5@linux.vnet.ibm.com>
2018-12-11 13:29 ` [PATCH v03] powerpc/mobility: Fix node detach/rename problem Michael Ellerman
2018-12-11 15:26   ` Michael Bringmann
2018-12-11 16:07   ` Rob Herring
2018-12-12 22:00     ` Frank Rowand
2018-12-13  2:57       ` Michael Ellerman
2018-12-14 21:58         ` Michael Bringmann
2018-12-13  2:56     ` Michael Ellerman
2018-12-11 16:43   ` Michael Bringmann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).