From mboxrd@z Thu Jan 1 00:00:00 1970 From: Sudeep Holla Subject: Re: [PATCH v9 00/12] Support PPTT for ARM64 Date: Tue, 29 May 2018 18:31:33 +0100 Message-ID: References: <20180511235807.30834-1-jeremy.linton@arm.com> <20180517170523.h7tuvbzdfluuidcz@armageddon.cambridge.arm.com> <09fb3fe7-d703-43f1-74f7-f8cb5ff1f67a@arm.com> <551905a6-eaa8-97df-06ec-1ceedfbc164f@arm.com> <20180529150823.GD17159@arm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Return-path: In-Reply-To: Content-Language: en-US Sender: linux-kernel-owner@vger.kernel.org To: Robin Murphy , Geert Uytterhoeven , Will Deacon Cc: Sudeep Holla , Mark Rutland , austinwc@codeaurora.org, tnowicki@caviumnetworks.com, Catalin Marinas , Palmer Dabbelt , linux-riscv@lists.infradead.org, wangxiongfeng2@huawei.com, vkilari@codeaurora.org, Lorenzo Pieralisi , jhugo@codeaurora.org, Morten.Rasmussen@arm.com, ACPI Devel Maling List , Len Brown , John Garry , Al Stone , Linux ARM , Ard Biesheuvel , Greg KH , "Rafael J. Wysocki" , Linux Kernel Mailing List List-Id: linux-acpi@vger.kernel.org On 29/05/18 18:08, Robin Murphy wrote: > On 29/05/18 16:51, Geert Uytterhoeven wrote: >> Hi Will, >> >> On Tue, May 29, 2018 at 5:08 PM, Will Deacon wrote: >>> On Tue, May 29, 2018 at 02:18:40PM +0100, Sudeep Holla wrote: >>>> On 29/05/18 12:56, Geert Uytterhoeven wrote: >>>>> On Tue, May 29, 2018 at 1:14 PM, Sudeep Holla >>>>> wrote: >>>>>> On 29/05/18 11:48, Geert Uytterhoeven wrote: >>>>>>> System supend still works fine on systems with big cores only: >>>>>>> >>>>>>>      R-Car H3 ES1.0 (4xCA57 (4xCA53 disabled in firmware)) >>>>>>>      R-Car M3-N (2xCA57) >>>>>>> >>>>>>> Reverting this commit fixes the issue for me. >>>>>> >>>>>> I can't find anything that relates to system suspend in these patches >>>>>> unless they are messing with something during CPU hot plug-in back >>>>>> during resume. >>>>> >>>>> It's only the last patch that introduces the breakage. >>>>> >>>> >>>> As specified in the commit log, it won't change any behavior for DT >>>> systems if it's non-NUMA or single node system. So I am still wondering >>>> what could trigger this regression. >>> >>> I wonder if we're somehow giving an uninitialised/invalid NUMA >>> configuration >>> to the scheduler, although I can't see how this would happen. >>> >>> Geert -- if you enable CONFIG_DEBUG_PER_CPU_MAPS=y and apply the diff >>> below >>> do you see anything shouting in dmesg? >> >> Thanks, but unfortunately it doesn't help. >> I added some debug code to print cpumask, but so far I don't see anything >> suspicious. > > Do you have CONFIG_NUMA enabled? On a hunch I've managed to reproduce > what looks like the same thing on a Juno board with NUMA=n; going in > with external debug it seems to be stuck in the loop in > init_sched_groups_capacity(), with an approximate stack trace of: > > > init_sched_groups_capacity() > partition_sched_domains() > cpuset_cpu_active() > sched_cpu_activate() > cpuhp_invoke_callback() > cpuhp_thread_fn() > > My hunch is based on the fact that it looks like we can, under the right > circumstances, end up with default_topology picking up cpu_online_mask > as a sibling mask via cpu_coregroup_mask(), and given the great > coincidence that that's going to change when hotplugging out CPUs on > suspend, things might not react too well to that. Things also look to go > utterly haywire once into a full-blown systemd userspace with cpuidle, > but I haven't got a clear picture of that yet. > Yes, I too observed the same. I was able to suspend resume if I have cpuidle disabled. -- Regards, Sudeep From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965510AbeE2Rbo (ORCPT ); Tue, 29 May 2018 13:31:44 -0400 Received: from usa-sjc-mx-foss1.foss.arm.com ([217.140.101.70]:45496 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S965010AbeE2Rbl (ORCPT ); Tue, 29 May 2018 13:31:41 -0400 Cc: Sudeep Holla , Mark Rutland , austinwc@codeaurora.org, tnowicki@caviumnetworks.com, Catalin Marinas , Palmer Dabbelt , linux-riscv@lists.infradead.org, wangxiongfeng2@huawei.com, vkilari@codeaurora.org, Lorenzo Pieralisi , jhugo@codeaurora.org, Morten.Rasmussen@arm.com, ACPI Devel Maling List , Len Brown , John Garry , Al Stone , Linux ARM , Ard Biesheuvel , Greg KH , "Rafael J. Wysocki" , Linux Kernel Mailing List , Jeremy Linton , Linux-Renesas , Hanjun Guo , Dietmar Eggemann Subject: Re: [PATCH v9 00/12] Support PPTT for ARM64 To: Robin Murphy , Geert Uytterhoeven , Will Deacon References: <20180511235807.30834-1-jeremy.linton@arm.com> <20180517170523.h7tuvbzdfluuidcz@armageddon.cambridge.arm.com> <09fb3fe7-d703-43f1-74f7-f8cb5ff1f67a@arm.com> <551905a6-eaa8-97df-06ec-1ceedfbc164f@arm.com> <20180529150823.GD17159@arm.com> From: Sudeep Holla Organization: ARM Message-ID: Date: Tue, 29 May 2018 18:31:33 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.8.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 29/05/18 18:08, Robin Murphy wrote: > On 29/05/18 16:51, Geert Uytterhoeven wrote: >> Hi Will, >> >> On Tue, May 29, 2018 at 5:08 PM, Will Deacon wrote: >>> On Tue, May 29, 2018 at 02:18:40PM +0100, Sudeep Holla wrote: >>>> On 29/05/18 12:56, Geert Uytterhoeven wrote: >>>>> On Tue, May 29, 2018 at 1:14 PM, Sudeep Holla >>>>> wrote: >>>>>> On 29/05/18 11:48, Geert Uytterhoeven wrote: >>>>>>> System supend still works fine on systems with big cores only: >>>>>>> >>>>>>>      R-Car H3 ES1.0 (4xCA57 (4xCA53 disabled in firmware)) >>>>>>>      R-Car M3-N (2xCA57) >>>>>>> >>>>>>> Reverting this commit fixes the issue for me. >>>>>> >>>>>> I can't find anything that relates to system suspend in these patches >>>>>> unless they are messing with something during CPU hot plug-in back >>>>>> during resume. >>>>> >>>>> It's only the last patch that introduces the breakage. >>>>> >>>> >>>> As specified in the commit log, it won't change any behavior for DT >>>> systems if it's non-NUMA or single node system. So I am still wondering >>>> what could trigger this regression. >>> >>> I wonder if we're somehow giving an uninitialised/invalid NUMA >>> configuration >>> to the scheduler, although I can't see how this would happen. >>> >>> Geert -- if you enable CONFIG_DEBUG_PER_CPU_MAPS=y and apply the diff >>> below >>> do you see anything shouting in dmesg? >> >> Thanks, but unfortunately it doesn't help. >> I added some debug code to print cpumask, but so far I don't see anything >> suspicious. > > Do you have CONFIG_NUMA enabled? On a hunch I've managed to reproduce > what looks like the same thing on a Juno board with NUMA=n; going in > with external debug it seems to be stuck in the loop in > init_sched_groups_capacity(), with an approximate stack trace of: > > > init_sched_groups_capacity() > partition_sched_domains() > cpuset_cpu_active() > sched_cpu_activate() > cpuhp_invoke_callback() > cpuhp_thread_fn() > > My hunch is based on the fact that it looks like we can, under the right > circumstances, end up with default_topology picking up cpu_online_mask > as a sibling mask via cpu_coregroup_mask(), and given the great > coincidence that that's going to change when hotplugging out CPUs on > suspend, things might not react too well to that. Things also look to go > utterly haywire once into a full-blown systemd userspace with cpuidle, > but I haven't got a clear picture of that yet. > Yes, I too observed the same. I was able to suspend resume if I have cpuidle disabled. -- Regards, Sudeep From mboxrd@z Thu Jan 1 00:00:00 1970 From: sudeep.holla@arm.com (Sudeep Holla) Date: Tue, 29 May 2018 18:31:33 +0100 Subject: [PATCH v9 00/12] Support PPTT for ARM64 In-Reply-To: References: <20180511235807.30834-1-jeremy.linton@arm.com> <20180517170523.h7tuvbzdfluuidcz@armageddon.cambridge.arm.com> <09fb3fe7-d703-43f1-74f7-f8cb5ff1f67a@arm.com> <551905a6-eaa8-97df-06ec-1ceedfbc164f@arm.com> <20180529150823.GD17159@arm.com> Message-ID: To: linux-riscv@lists.infradead.org List-Id: linux-riscv.lists.infradead.org On 29/05/18 18:08, Robin Murphy wrote: > On 29/05/18 16:51, Geert Uytterhoeven wrote: >> Hi Will, >> >> On Tue, May 29, 2018 at 5:08 PM, Will Deacon wrote: >>> On Tue, May 29, 2018 at 02:18:40PM +0100, Sudeep Holla wrote: >>>> On 29/05/18 12:56, Geert Uytterhoeven wrote: >>>>> On Tue, May 29, 2018 at 1:14 PM, Sudeep Holla >>>>> wrote: >>>>>> On 29/05/18 11:48, Geert Uytterhoeven wrote: >>>>>>> System supend still works fine on systems with big cores only: >>>>>>> >>>>>>> ???? R-Car H3 ES1.0 (4xCA57 (4xCA53 disabled in firmware)) >>>>>>> ???? R-Car M3-N (2xCA57) >>>>>>> >>>>>>> Reverting this commit fixes the issue for me. >>>>>> >>>>>> I can't find anything that relates to system suspend in these patches >>>>>> unless they are messing with something during CPU hot plug-in back >>>>>> during resume. >>>>> >>>>> It's only the last patch that introduces the breakage. >>>>> >>>> >>>> As specified in the commit log, it won't change any behavior for DT >>>> systems if it's non-NUMA or single node system. So I am still wondering >>>> what could trigger this regression. >>> >>> I wonder if we're somehow giving an uninitialised/invalid NUMA >>> configuration >>> to the scheduler, although I can't see how this would happen. >>> >>> Geert -- if you enable CONFIG_DEBUG_PER_CPU_MAPS=y and apply the diff >>> below >>> do you see anything shouting in dmesg? >> >> Thanks, but unfortunately it doesn't help. >> I added some debug code to print cpumask, but so far I don't see anything >> suspicious. > > Do you have CONFIG_NUMA enabled? On a hunch I've managed to reproduce > what looks like the same thing on a Juno board with NUMA=n; going in > with external debug it seems to be stuck in the loop in > init_sched_groups_capacity(), with an approximate stack trace of: > > > init_sched_groups_capacity() > partition_sched_domains() > cpuset_cpu_active() > sched_cpu_activate() > cpuhp_invoke_callback() > cpuhp_thread_fn() > > My hunch is based on the fact that it looks like we can, under the right > circumstances, end up with default_topology picking up cpu_online_mask > as a sibling mask via cpu_coregroup_mask(), and given the great > coincidence that that's going to change when hotplugging out CPUs on > suspend, things might not react too well to that. Things also look to go > utterly haywire once into a full-blown systemd userspace with cpuidle, > but I haven't got a clear picture of that yet. > Yes, I too observed the same. I was able to suspend resume if I have cpuidle disabled. -- Regards, Sudeep From mboxrd@z Thu Jan 1 00:00:00 1970 From: sudeep.holla@arm.com (Sudeep Holla) Date: Tue, 29 May 2018 18:31:33 +0100 Subject: [PATCH v9 00/12] Support PPTT for ARM64 In-Reply-To: References: <20180511235807.30834-1-jeremy.linton@arm.com> <20180517170523.h7tuvbzdfluuidcz@armageddon.cambridge.arm.com> <09fb3fe7-d703-43f1-74f7-f8cb5ff1f67a@arm.com> <551905a6-eaa8-97df-06ec-1ceedfbc164f@arm.com> <20180529150823.GD17159@arm.com> Message-ID: To: linux-arm-kernel@lists.infradead.org List-Id: linux-arm-kernel.lists.infradead.org On 29/05/18 18:08, Robin Murphy wrote: > On 29/05/18 16:51, Geert Uytterhoeven wrote: >> Hi Will, >> >> On Tue, May 29, 2018 at 5:08 PM, Will Deacon wrote: >>> On Tue, May 29, 2018 at 02:18:40PM +0100, Sudeep Holla wrote: >>>> On 29/05/18 12:56, Geert Uytterhoeven wrote: >>>>> On Tue, May 29, 2018 at 1:14 PM, Sudeep Holla >>>>> wrote: >>>>>> On 29/05/18 11:48, Geert Uytterhoeven wrote: >>>>>>> System supend still works fine on systems with big cores only: >>>>>>> >>>>>>> ???? R-Car H3 ES1.0 (4xCA57 (4xCA53 disabled in firmware)) >>>>>>> ???? R-Car M3-N (2xCA57) >>>>>>> >>>>>>> Reverting this commit fixes the issue for me. >>>>>> >>>>>> I can't find anything that relates to system suspend in these patches >>>>>> unless they are messing with something during CPU hot plug-in back >>>>>> during resume. >>>>> >>>>> It's only the last patch that introduces the breakage. >>>>> >>>> >>>> As specified in the commit log, it won't change any behavior for DT >>>> systems if it's non-NUMA or single node system. So I am still wondering >>>> what could trigger this regression. >>> >>> I wonder if we're somehow giving an uninitialised/invalid NUMA >>> configuration >>> to the scheduler, although I can't see how this would happen. >>> >>> Geert -- if you enable CONFIG_DEBUG_PER_CPU_MAPS=y and apply the diff >>> below >>> do you see anything shouting in dmesg? >> >> Thanks, but unfortunately it doesn't help. >> I added some debug code to print cpumask, but so far I don't see anything >> suspicious. > > Do you have CONFIG_NUMA enabled? On a hunch I've managed to reproduce > what looks like the same thing on a Juno board with NUMA=n; going in > with external debug it seems to be stuck in the loop in > init_sched_groups_capacity(), with an approximate stack trace of: > > > init_sched_groups_capacity() > partition_sched_domains() > cpuset_cpu_active() > sched_cpu_activate() > cpuhp_invoke_callback() > cpuhp_thread_fn() > > My hunch is based on the fact that it looks like we can, under the right > circumstances, end up with default_topology picking up cpu_online_mask > as a sibling mask via cpu_coregroup_mask(), and given the great > coincidence that that's going to change when hotplugging out CPUs on > suspend, things might not react too well to that. Things also look to go > utterly haywire once into a full-blown systemd userspace with cpuidle, > but I haven't got a clear picture of that yet. > Yes, I too observed the same. I was able to suspend resume if I have cpuidle disabled. -- Regards, Sudeep