linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3] interconnect: qcom: icc-rpmh: Add BCMs to commit list in pre_aggregate
@ 2021-11-25 17:47 Georgi Djakov
  2022-04-05 23:00 ` Stephen Boyd
  0 siblings, 1 reply; 10+ messages in thread
From: Georgi Djakov @ 2021-11-25 17:47 UTC (permalink / raw)
  To: djakov, quic_mdtipton, okukatla
  Cc: linux-arm-msm, linux-pm, linux-kernel, Mike Tipton

From: Mike Tipton <mdtipton@codeaurora.org>

We're only adding BCMs to the commit list in aggregate(), but there are
cases where pre_aggregate() is called without subsequently calling
aggregate(). In particular, in icc_sync_state() when a node with initial
BW has zero requests. Since BCMs aren't added to the commit list in
these cases, we don't actually send the zero BW request to HW. So the
resources remain on unnecessarily.

Add BCMs to the commit list in pre_aggregate() instead, which is always
called even when there are no requests.

Signed-off-by: Mike Tipton <mdtipton@codeaurora.org>
[georgi: remove icc_sync_state for platforms with incomplete support]
Signed-off-by: Georgi Djakov <djakov@kernel.org>
---
v3:
* This patch lowers the power consumption of the boards that have
  (almost) full interconnect support enabled in their client drivers.
  This might cause however issues on platforms with very minimal
  interconnect support that have the sync_state callback set.
  We remove the sync_state callback for these platforms to prevent
  them from turning off any interconnects that have no clients yet.

v2: https://lore.kernel.org/lkml/20210721175432.2119-5-mdtipton@codeaurora.org/

 drivers/interconnect/qcom/icc-rpmh.c | 10 +++++-----
 drivers/interconnect/qcom/sm8150.c   |  1 -
 drivers/interconnect/qcom/sm8250.c   |  1 -
 drivers/interconnect/qcom/sm8350.c   |  1 -
 4 files changed, 5 insertions(+), 8 deletions(-)

diff --git a/drivers/interconnect/qcom/icc-rpmh.c b/drivers/interconnect/qcom/icc-rpmh.c
index 3eb7936d2cf6..2c8e12549804 100644
--- a/drivers/interconnect/qcom/icc-rpmh.c
+++ b/drivers/interconnect/qcom/icc-rpmh.c
@@ -21,13 +21,18 @@ void qcom_icc_pre_aggregate(struct icc_node *node)
 {
 	size_t i;
 	struct qcom_icc_node *qn;
+	struct qcom_icc_provider *qp;
 
 	qn = node->data;
+	qp = to_qcom_provider(node->provider);
 
 	for (i = 0; i < QCOM_ICC_NUM_BUCKETS; i++) {
 		qn->sum_avg[i] = 0;
 		qn->max_peak[i] = 0;
 	}
+
+	for (i = 0; i < qn->num_bcms; i++)
+		qcom_icc_bcm_voter_add(qp->voter, qn->bcms[i]);
 }
 EXPORT_SYMBOL_GPL(qcom_icc_pre_aggregate);
 
@@ -45,10 +50,8 @@ int qcom_icc_aggregate(struct icc_node *node, u32 tag, u32 avg_bw,
 {
 	size_t i;
 	struct qcom_icc_node *qn;
-	struct qcom_icc_provider *qp;
 
 	qn = node->data;
-	qp = to_qcom_provider(node->provider);
 
 	if (!tag)
 		tag = QCOM_ICC_TAG_ALWAYS;
@@ -68,9 +71,6 @@ int qcom_icc_aggregate(struct icc_node *node, u32 tag, u32 avg_bw,
 	*agg_avg += avg_bw;
 	*agg_peak = max_t(u32, *agg_peak, peak_bw);
 
-	for (i = 0; i < qn->num_bcms; i++)
-		qcom_icc_bcm_voter_add(qp->voter, qn->bcms[i]);
-
 	return 0;
 }
 EXPORT_SYMBOL_GPL(qcom_icc_aggregate);
diff --git a/drivers/interconnect/qcom/sm8150.c b/drivers/interconnect/qcom/sm8150.c
index 2a85f53802b5..745e3c36a61a 100644
--- a/drivers/interconnect/qcom/sm8150.c
+++ b/drivers/interconnect/qcom/sm8150.c
@@ -535,7 +535,6 @@ static struct platform_driver qnoc_driver = {
 	.driver = {
 		.name = "qnoc-sm8150",
 		.of_match_table = qnoc_of_match,
-		.sync_state = icc_sync_state,
 	},
 };
 module_platform_driver(qnoc_driver);
diff --git a/drivers/interconnect/qcom/sm8250.c b/drivers/interconnect/qcom/sm8250.c
index 8dfb5dea562a..aa707582ea01 100644
--- a/drivers/interconnect/qcom/sm8250.c
+++ b/drivers/interconnect/qcom/sm8250.c
@@ -551,7 +551,6 @@ static struct platform_driver qnoc_driver = {
 	.driver = {
 		.name = "qnoc-sm8250",
 		.of_match_table = qnoc_of_match,
-		.sync_state = icc_sync_state,
 	},
 };
 module_platform_driver(qnoc_driver);
diff --git a/drivers/interconnect/qcom/sm8350.c b/drivers/interconnect/qcom/sm8350.c
index 3e26a2175b28..c79f93a1ac73 100644
--- a/drivers/interconnect/qcom/sm8350.c
+++ b/drivers/interconnect/qcom/sm8350.c
@@ -531,7 +531,6 @@ static struct platform_driver qnoc_driver = {
 	.driver = {
 		.name = "qnoc-sm8350",
 		.of_match_table = qnoc_of_match,
-		.sync_state = icc_sync_state,
 	},
 };
 module_platform_driver(qnoc_driver);

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v3] interconnect: qcom: icc-rpmh: Add BCMs to commit list in pre_aggregate
  2021-11-25 17:47 [PATCH v3] interconnect: qcom: icc-rpmh: Add BCMs to commit list in pre_aggregate Georgi Djakov
@ 2022-04-05 23:00 ` Stephen Boyd
  2022-04-06  4:47   ` Stephen Boyd
  2022-04-11 15:59   ` Alex Elder
  0 siblings, 2 replies; 10+ messages in thread
From: Stephen Boyd @ 2022-04-05 23:00 UTC (permalink / raw)
  To: djakov, okukatla, quic_mdtipton, Alex Elder
  Cc: linux-arm-msm, linux-pm, linux-kernel, Mike Tipton, mka, dianders

Quoting Georgi Djakov (2021-11-25 09:47:51)
> From: Mike Tipton <mdtipton@codeaurora.org>
>
> We're only adding BCMs to the commit list in aggregate(), but there are
> cases where pre_aggregate() is called without subsequently calling
> aggregate(). In particular, in icc_sync_state() when a node with initial
> BW has zero requests. Since BCMs aren't added to the commit list in
> these cases, we don't actually send the zero BW request to HW. So the
> resources remain on unnecessarily.
>
> Add BCMs to the commit list in pre_aggregate() instead, which is always
> called even when there are no requests.
>
> Signed-off-by: Mike Tipton <mdtipton@codeaurora.org>
> [georgi: remove icc_sync_state for platforms with incomplete support]
> Signed-off-by: Georgi Djakov <djakov@kernel.org>

This patch fixes suspend/resume for me on sc7180-trogdor-lazor. Without
it I can't achieve XO shutdown. It seems that it fixes the sync_state
support that was added in commit b1d681d8d324 ("interconnect: Add sync
state support"). Before that commit suspend worked because the
interconnect wasn't maxed out at boot. After that commit we started
maxing out the interconnect state and never dropping it.

It would be good to pick this back to stable kernels so we have a
working suspend/resume on LTS kernels. I tried picking it back to
5.10.109 (latest 5.10 LTS) and booting it on my Lazor w/ LTE device but
it crashes at boot pretty reliably in the IPA driver. Interestingly I
can't get it to crash on 5.15.32 when I pick it back, so maybe something
has changed between 5.10 and 5.15 for IPA? I'll try to bisect it.

[   23.708432] Internal error: synchronous external abort: 96000010
[#1] PREEMPT SMP
[   23.708451] Modules linked in: veth rfcomm algif_hash
algif_skcipher af_alg uinput xt_MASQUERADE uvcvideo videobuf2_vmalloc
venus_enc venus_dec videobuf2_dma_sg videobuf2_memops venus_core
v4l2_mem2mem videobuf2_v4l2 cros_ec_typec videobuf2_common hci_uart
typec btqca qcom_q6v5_mss ipa qcom_pil_info qcom_q6v5 qcom_common
rmtfs_mem ip6table_nat fuse 8021q bluetooth ecdh_generic ecc
ath10k_snoc ath10k_core ath lzo_rle lzo_compress mac80211 zram
cfg80211 r8152 mii joydev
[   23.708565] CPU: 5 PID: 3706 Comm: mmdata_mgr Not tainted 5.10.109+ #61
[   23.708571] Hardware name: Google Lazor (rev1+) with LTE (DT)
[   23.708578] pstate: 60400009 (nZCv daif +PAN -UAO -TCO BTYPE=--)
[   23.708597] pc : gsi_channel_start+0x78/0x1dc [ipa]
[   23.708609] lr : gsi_channel_start+0x4c/0x1dc [ipa]
[   23.708614] sp : ffffffc013d9ba20
[   23.708619] x29: ffffffc013d9ba20 x28: 0000000000000000
[   23.708628] x27: 0000000000000000 x26: ffffffc013d9bc20
[   23.708637] x25: 000000000001c000 x24: 0000000000000000
[   23.708646] x23: ffffffab00cb9410 x22: 00000000712dcf80
[   23.708654] x21: ffffffab486bc148 x20: ffffffab486b8a18
[   23.708663] x19: ffffffab486b8000 x18: 00000000ffff0a00
[   23.708671] x17: 000000002f7254f1 x16: ffffffeb3db6f344
[   23.708680] x15: 00000000ffee6094 x14: ffffffffffffffff
[   23.708689] x13: 0000000000000010 x12: 0101010101010101
[   23.708697] x11: 000000000000013c x10: 0000000000000000
[   23.708706] x9 : 000000000001c000 x8 : ffffffc018f1c000
[   23.708715] x7 : fefefefefeff2f60 x6 : 0000808080808080
[   23.708724] x5 : 0000000000000000 x4 : 8080808080800000
[   23.708732] x3 : 0000000000000000 x2 : ffffffab5089eac0
[   23.708741] x1 : 0000000000000000 x0 : 0000000000000000
[   23.708750] Call trace:
[   23.708762]  gsi_channel_start+0x78/0x1dc [ipa]
[   23.708773]  ipa_endpoint_enable_one+0x34/0xc0 [ipa]
[   23.708785]  ipa_open+0x30/0x98 [ipa]
[   23.708795]  __dev_open+0xd8/0x190
[   23.708803]  __dev_change_flags+0xbc/0x1c8
[   23.708810]  dev_change_flags+0x30/0x70
[   23.708818]  devinet_ioctl+0x274/0x500
[   23.708824]  inet_ioctl+0x10c/0x394
[   23.708831]  sock_do_ioctl+0x58/0x324
[   23.708837]  compat_sock_ioctl+0x238/0xdb0
[   23.708845]  __arm64_compat_sys_ioctl+0xcc/0x104
[   23.708854]  el0_svc_common+0xec/0x1dc
[   23.708860]  do_el0_svc_compat+0x28/0x54
[   23.708868]  el0_svc_compat+0x10/0x1c
[   23.708874]  el0_sync_compat_handler+0xc0/0xf0
[   23.708880]  el0_sync_compat+0x184/0x1c0
[   23.708890] Code: 51286129 53037d29 1b166529 8b090108 (b9400108)

Note I had to pick a handful of other patches for nvmem to get normal
boot on 5.10.109. I'll send those over to stable maintainers shortly.

	fd3bb8f54a88 ("nvmem: core: Add support for keepout regions")
	de0534df9347 ("nvmem: core: fix error handling while validating
keepout regions")
	044ee8f85267 ("nvmem: qfprom: Don't touch certain fuses")
	437145dbcdee ("arm64: dts: qcom: sc7180: Add soc-specific qfprom
compat string")
	437cdef515e2 ("arm64: dts: qcom: sc7180:: modified qfprom CORR size
as per RAW size")

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3] interconnect: qcom: icc-rpmh: Add BCMs to commit list in pre_aggregate
  2022-04-05 23:00 ` Stephen Boyd
@ 2022-04-06  4:47   ` Stephen Boyd
  2022-04-08 18:52     ` Stephen Boyd
  2022-04-11 15:59   ` Alex Elder
  1 sibling, 1 reply; 10+ messages in thread
From: Stephen Boyd @ 2022-04-06  4:47 UTC (permalink / raw)
  To: Alex Elder, djakov, okukatla, quic_mdtipton
  Cc: linux-arm-msm, linux-pm, linux-kernel, mka, dianders

Quoting Stephen Boyd (2022-04-05 16:00:55)
> Quoting Georgi Djakov (2021-11-25 09:47:51)
> > From: Mike Tipton <mdtipton@codeaurora.org>
> >
> > We're only adding BCMs to the commit list in aggregate(), but there are
> > cases where pre_aggregate() is called without subsequently calling
> > aggregate(). In particular, in icc_sync_state() when a node with initial
> > BW has zero requests. Since BCMs aren't added to the commit list in
> > these cases, we don't actually send the zero BW request to HW. So the
> > resources remain on unnecessarily.
> >
> > Add BCMs to the commit list in pre_aggregate() instead, which is always
> > called even when there are no requests.
> >
> > Signed-off-by: Mike Tipton <mdtipton@codeaurora.org>
> > [georgi: remove icc_sync_state for platforms with incomplete support]
> > Signed-off-by: Georgi Djakov <djakov@kernel.org>
>
> This patch fixes suspend/resume for me on sc7180-trogdor-lazor. Without
> it I can't achieve XO shutdown. It seems that it fixes the sync_state
> support that was added in commit b1d681d8d324 ("interconnect: Add sync
> state support"). Before that commit suspend worked because the
> interconnect wasn't maxed out at boot. After that commit we started
> maxing out the interconnect state and never dropping it.

I'm also wondering if this means suspend doesn't work without sync_state
support? Does this mean that device links are required? And device links
are only made if fw_devlink is enabled? I don't see any devlinks made in
drivers/interconnect so I worry that we have to use fw_devlink to get
device links made to make sync_state happen to remove the max votes that
are put in at boot.

>
> It would be good to pick this back to stable kernels so we have a
> working suspend/resume on LTS kernels. I tried picking it back to
> 5.10.109 (latest 5.10 LTS) and booting it on my Lazor w/ LTE device but
> it crashes at boot pretty reliably in the IPA driver. Interestingly I
> can't get it to crash on 5.15.32 when I pick it back, so maybe something
> has changed between 5.10 and 5.15 for IPA? I'll try to bisect it.

Bisecting pointed to commit 1aac309d3207 ("net: ipa: use autosuspend")
as fixing it. I think before that commit we weren't enabling some
interconnect, but now we're booting, runtime suspending, and then
runtime resuming again. With the sync state patch I suspect the
interconnect bandwidth is dropped and IPA needs to use runtime PM to
actually turn resources back on because it assumed that resources are on
when it probes.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3] interconnect: qcom: icc-rpmh: Add BCMs to commit list in pre_aggregate
  2022-04-06  4:47   ` Stephen Boyd
@ 2022-04-08 18:52     ` Stephen Boyd
  0 siblings, 0 replies; 10+ messages in thread
From: Stephen Boyd @ 2022-04-08 18:52 UTC (permalink / raw)
  To: Alex Elder, djakov, quic_mdtipton
  Cc: linux-arm-msm, linux-pm, linux-kernel, mka, dianders

Quoting Stephen Boyd (2022-04-05 21:47:07)
> Quoting Stephen Boyd (2022-04-05 16:00:55)
> >
> > It would be good to pick this back to stable kernels so we have a
> > working suspend/resume on LTS kernels. I tried picking it back to
> > 5.10.109 (latest 5.10 LTS) and booting it on my Lazor w/ LTE device but
> > it crashes at boot pretty reliably in the IPA driver. Interestingly I
> > can't get it to crash on 5.15.32 when I pick it back, so maybe something
> > has changed between 5.10 and 5.15 for IPA? I'll try to bisect it.
>
> Bisecting pointed to commit 1aac309d3207 ("net: ipa: use autosuspend")
> as fixing it. I think before that commit we weren't enabling some
> interconnect, but now we're booting, runtime suspending, and then
> runtime resuming again. With the sync state patch I suspect the
> interconnect bandwidth is dropped and IPA needs to use runtime PM to
> actually turn resources back on because it assumed that resources are on
> when it probes.

I also found that when I make CONFIG_QCOM_IPA=y (and subsequently
CONFIG_QCOM_Q6V5_MSS=y) I can reproduce a different crash in IPA on
5.15.32 and 5.17.1 LTS kernels. I suppose there is some missing
interconnect bandwidth request somewhere and the runtime PM patch half
fixed it, except for when the modem and IPA drivers are builtin. When
the two drivers are builtin they drop bandwidth requests earlier because
they probe earlier. My guess is that the IPA driver is missing a
runtime_pm_get_sync() call somewhere and accessing a register that isn't
clocked. More digging...

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3] interconnect: qcom: icc-rpmh: Add BCMs to commit list in pre_aggregate
  2022-04-05 23:00 ` Stephen Boyd
  2022-04-06  4:47   ` Stephen Boyd
@ 2022-04-11 15:59   ` Alex Elder
  2022-04-11 19:06     ` Stephen Boyd
  1 sibling, 1 reply; 10+ messages in thread
From: Alex Elder @ 2022-04-11 15:59 UTC (permalink / raw)
  To: Stephen Boyd, djakov, okukatla, quic_mdtipton
  Cc: linux-arm-msm, linux-pm, linux-kernel, Mike Tipton, mka, dianders

On 4/5/22 6:00 PM, Stephen Boyd wrote:
> Quoting Georgi Djakov (2021-11-25 09:47:51)
>> From: Mike Tipton <mdtipton@codeaurora.org>
>>
>> We're only adding BCMs to the commit list in aggregate(), but there are
>> cases where pre_aggregate() is called without subsequently calling
>> aggregate(). In particular, in icc_sync_state() when a node with initial
>> BW has zero requests. Since BCMs aren't added to the commit list in
>> these cases, we don't actually send the zero BW request to HW. So the
>> resources remain on unnecessarily.
>>
>> Add BCMs to the commit list in pre_aggregate() instead, which is always
>> called even when there are no requests.
>>
>> Signed-off-by: Mike Tipton <mdtipton@codeaurora.org>
>> [georgi: remove icc_sync_state for platforms with incomplete support]
>> Signed-off-by: Georgi Djakov <djakov@kernel.org>

I'm back from vacation and am finally giving proper attention to
this.  I want to make sure I understand the problem, because there
are (at least) two parts to it.

- The first problem you observe is that you are not seeing XO
   shutdown on suspend on a Lazor device.
- You didn't say this directly but I think you are seeing this
   on Linux v5.15.y (the 5.15 LTS branch), or perhaps on something
   derived from that branch.
- You find that if you back-port (or cherry-pick?) the commit
   that landed upstream as b95b668eaaa2 ("interconnect: qcom:
   icc-rpmh: Add BCMs to commit list in pre_aggregate
"), you
   *do* see XO shutdown on suspend, as desired.

Here's what I understand that commit to do:
- In some cases, the bus clock managers (BCMs) are configured
   by the boot loader so that some interconnects have non-zero
   initial bandwidth.
- There is no sense in keeping an interconnect active if Linux
   has nothing that requires its use.  So we would like Linux to
   ensure the configured bandwidth for an *unused* interconnect
   is zero.
- Prior to that commit, BCM-managed hardware was only queued
   to update its configuration when the ->aggregate interconnect
   provider function was called.  After that commit, updates were
   queued by the ->pre_aggregate provider function.
- Unlike the ->aggregate callback, the ->pre_aggregate provider
   function queues updates to the hardware configuration whether
   or not they have active users.
- The result of this commit is that the hardware configuration
   for all defined BCM-managed interconnects is updated, and in
   particular, the configured bandwidth for unused interconnects
   is set to zero.

When unused interconnects are configured for zero bandwidth, they
do not require an active main XO clock, and so with this commit
it becomes possible for the XO clock to be shut down.

And that's why this commit addresses your XO shutdown problem on
the Linux 5.15 LTS branch.

Is the above an accurate description?

Looking at that branch, I see this commit:  f753067494c27
("Revert "interconnect: qcom: icc-rpmh: Add BCMs to commit
list in pre_aggregate"
").  Which shows that an attempt was made
to include this commit in the 5.15 LTS branch, but it caused
some *other* regressions.  That suggests this might not be
easy to fix.

---

The second problem you have is exhibited by the IPA driver if
the "fix" commit (upstream b95b668eaaa2) is back-ported to the
Linux 5.10.y LTS branch (along with some other prerequisite
commits).  We can conclude that applying the above commit
makes the bandwidth for an unused interconnect (or perhaps
the rate for the IPA core clock) get set to zero.  And in that
case, an attempt to access IPA hardware leads to the crash you
observed.

The IPA driver does not implement runtime power management
until Linux v5.15.  You later said you thought enabling that
might ensure the clock and interconnects were active when
needed by the IPA driver, and I concur (but there could be a
little more to it).

In any case, based on the time stamp in your log, it seems
this problem is likely occurring upon the first access to IPA
hardware.

I have a hunch about what might be happening here.  There is
some synchronization that must occur between the AP and modem
when IPA is starting up.  Until that synchronization step has
completed, we can't allow the IPA network device to be opened.
In later kernels I think this is precluded, but perhaps in
Linux v5.10 it isn't.  Until I look a little more closely I'm
not sure what would happen, but it *could* be this.

I'm going to look a little how the particular access that
caused the crash is prevented in newer kernels.  It could
be that back-porting that (or re-implementing it for the
older kernel) will address the crash you're seeing.

					-Alex

> This patch fixes suspend/resume for me on sc7180-trogdor-lazor. Without
> it I can't achieve XO shutdown. It seems that it fixes the sync_state
> support that was added in commit b1d681d8d324 ("interconnect: Add sync
> state support"). Before that commit suspend worked because the
> interconnect wasn't maxed out at boot. After that commit we started
> maxing out the interconnect state and never dropping it.
> 
> It would be good to pick this back to stable kernels so we have a
> working suspend/resume on LTS kernels. I tried picking it back to
> 5.10.109 (latest 5.10 LTS) and booting it on my Lazor w/ LTE device but
> it crashes at boot pretty reliably in the IPA driver. Interestingly I
> can't get it to crash on 5.15.32 when I pick it back, so maybe something
> has changed between 5.10 and 5.15 for IPA? I'll try to bisect it.
> 
> [   23.708432] Internal error: synchronous external abort: 96000010
> [#1] PREEMPT SMP
> [   23.708451] Modules linked in: veth rfcomm algif_hash
> algif_skcipher af_alg uinput xt_MASQUERADE uvcvideo videobuf2_vmalloc
> venus_enc venus_dec videobuf2_dma_sg videobuf2_memops venus_core
> v4l2_mem2mem videobuf2_v4l2 cros_ec_typec videobuf2_common hci_uart
> typec btqca qcom_q6v5_mss ipa qcom_pil_info qcom_q6v5 qcom_common
> rmtfs_mem ip6table_nat fuse 8021q bluetooth ecdh_generic ecc
> ath10k_snoc ath10k_core ath lzo_rle lzo_compress mac80211 zram
> cfg80211 r8152 mii joydev
> [   23.708565] CPU: 5 PID: 3706 Comm: mmdata_mgr Not tainted 5.10.109+ #61
> [   23.708571] Hardware name: Google Lazor (rev1+) with LTE (DT)
> [   23.708578] pstate: 60400009 (nZCv daif +PAN -UAO -TCO BTYPE=--)
> [   23.708597] pc : gsi_channel_start+0x78/0x1dc [ipa]
> [   23.708609] lr : gsi_channel_start+0x4c/0x1dc [ipa]
> [   23.708614] sp : ffffffc013d9ba20
> [   23.708619] x29: ffffffc013d9ba20 x28: 0000000000000000
> [   23.708628] x27: 0000000000000000 x26: ffffffc013d9bc20
> [   23.708637] x25: 000000000001c000 x24: 0000000000000000
> [   23.708646] x23: ffffffab00cb9410 x22: 00000000712dcf80
> [   23.708654] x21: ffffffab486bc148 x20: ffffffab486b8a18
> [   23.708663] x19: ffffffab486b8000 x18: 00000000ffff0a00
> [   23.708671] x17: 000000002f7254f1 x16: ffffffeb3db6f344
> [   23.708680] x15: 00000000ffee6094 x14: ffffffffffffffff
> [   23.708689] x13: 0000000000000010 x12: 0101010101010101
> [   23.708697] x11: 000000000000013c x10: 0000000000000000
> [   23.708706] x9 : 000000000001c000 x8 : ffffffc018f1c000
> [   23.708715] x7 : fefefefefeff2f60 x6 : 0000808080808080
> [   23.708724] x5 : 0000000000000000 x4 : 8080808080800000
> [   23.708732] x3 : 0000000000000000 x2 : ffffffab5089eac0
> [   23.708741] x1 : 0000000000000000 x0 : 0000000000000000
> [   23.708750] Call trace:
> [   23.708762]  gsi_channel_start+0x78/0x1dc [ipa]
> [   23.708773]  ipa_endpoint_enable_one+0x34/0xc0 [ipa]
> [   23.708785]  ipa_open+0x30/0x98 [ipa]
> [   23.708795]  __dev_open+0xd8/0x190
> [   23.708803]  __dev_change_flags+0xbc/0x1c8
> [   23.708810]  dev_change_flags+0x30/0x70
> [   23.708818]  devinet_ioctl+0x274/0x500
> [   23.708824]  inet_ioctl+0x10c/0x394
> [   23.708831]  sock_do_ioctl+0x58/0x324
> [   23.708837]  compat_sock_ioctl+0x238/0xdb0
> [   23.708845]  __arm64_compat_sys_ioctl+0xcc/0x104
> [   23.708854]  el0_svc_common+0xec/0x1dc
> [   23.708860]  do_el0_svc_compat+0x28/0x54
> [   23.708868]  el0_svc_compat+0x10/0x1c
> [   23.708874]  el0_sync_compat_handler+0xc0/0xf0
> [   23.708880]  el0_sync_compat+0x184/0x1c0
> [   23.708890] Code: 51286129 53037d29 1b166529 8b090108 (b9400108)
> 
> Note I had to pick a handful of other patches for nvmem to get normal
> boot on 5.10.109. I'll send those over to stable maintainers shortly.
> 
> 	fd3bb8f54a88 ("nvmem: core: Add support for keepout regions")
> 	de0534df9347 ("nvmem: core: fix error handling while validating
> keepout regions")
> 	044ee8f85267 ("nvmem: qfprom: Don't touch certain fuses")
> 	437145dbcdee ("arm64: dts: qcom: sc7180: Add soc-specific qfprom
> compat string")
> 	437cdef515e2 ("arm64: dts: qcom: sc7180:: modified qfprom CORR size
> as per RAW size")


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3] interconnect: qcom: icc-rpmh: Add BCMs to commit list in pre_aggregate
  2022-04-11 15:59   ` Alex Elder
@ 2022-04-11 19:06     ` Stephen Boyd
  2022-04-11 20:20       ` Stephen Boyd
                         ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Stephen Boyd @ 2022-04-11 19:06 UTC (permalink / raw)
  To: Alex Elder, djakov, okukatla, quic_mdtipton
  Cc: linux-arm-msm, linux-pm, linux-kernel, Mike Tipton, mka, dianders

Quoting Alex Elder (2022-04-11 08:59:07)
> On 4/5/22 6:00 PM, Stephen Boyd wrote:
> > Quoting Georgi Djakov (2021-11-25 09:47:51)
> >> From: Mike Tipton <mdtipton@codeaurora.org>
> >>
> >> We're only adding BCMs to the commit list in aggregate(), but there are
> >> cases where pre_aggregate() is called without subsequently calling
> >> aggregate(). In particular, in icc_sync_state() when a node with initial
> >> BW has zero requests. Since BCMs aren't added to the commit list in
> >> these cases, we don't actually send the zero BW request to HW. So the
> >> resources remain on unnecessarily.
> >>
> >> Add BCMs to the commit list in pre_aggregate() instead, which is always
> >> called even when there are no requests.
> >>
> >> Signed-off-by: Mike Tipton <mdtipton@codeaurora.org>
> >> [georgi: remove icc_sync_state for platforms with incomplete support]
> >> Signed-off-by: Georgi Djakov <djakov@kernel.org>
>
> I'm back from vacation and am finally giving proper attention to
> this.  I want to make sure I understand the problem, because there
> are (at least) two parts to it.
>
> - The first problem you observe is that you are not seeing XO
>    shutdown on suspend on a Lazor device.
> - You didn't say this directly but I think you are seeing this
>    on Linux v5.15.y (the 5.15 LTS branch), or perhaps on something
>    derived from that branch.

Yes.

> - You find that if you back-port (or cherry-pick?) the commit
>    that landed upstream as b95b668eaaa2 ("interconnect: qcom:
>    icc-rpmh: Add BCMs to commit list in pre_aggregate
> "), you
>    *do* see XO shutdown on suspend, as desired.

Correct.

>
> Here's what I understand that commit to do:
> - In some cases, the bus clock managers (BCMs) are configured
>    by the boot loader so that some interconnects have non-zero
>    initial bandwidth.
> - There is no sense in keeping an interconnect active if Linux
>    has nothing that requires its use.  So we would like Linux to
>    ensure the configured bandwidth for an *unused* interconnect
>    is zero.
> - Prior to that commit, BCM-managed hardware was only queued
>    to update its configuration when the ->aggregate interconnect
>    provider function was called.  After that commit, updates were
>    queued by the ->pre_aggregate provider function.

Also before that commit interconnects are maxed out, which doesn't
really matter for XO shutdown but it means that we're running faster
than what the bootloader configures if boot is slower.

> - Unlike the ->aggregate callback, the ->pre_aggregate provider
>    function queues updates to the hardware configuration whether
>    or not they have active users.
> - The result of this commit is that the hardware configuration
>    for all defined BCM-managed interconnects is updated, and in
>    particular, the configured bandwidth for unused interconnects
>    is set to zero.

Yep.

>
> When unused interconnects are configured for zero bandwidth, they
> do not require an active main XO clock, and so with this commit
> it becomes possible for the XO clock to be shut down.
>
> And that's why this commit addresses your XO shutdown problem on
> the Linux 5.15 LTS branch.
>
> Is the above an accurate description?

Yeah pretty much. Without the interconnect patch I can't get XO
shutdown.

>
> Looking at that branch, I see this commit:  f753067494c27
> ("Revert "interconnect: qcom: icc-rpmh: Add BCMs to commit
> list in pre_aggregate"
> ").  Which shows that an attempt was made
> to include this commit in the 5.15 LTS branch, but it caused
> some *other* regressions.  That suggests this might not be
> easy to fix.

Indeed. The commit was reverted because it broke reboot for me. I see it
was reintroduced a few months later though when I was eating
Thanksgiving dinner and I didn't notice until now. Interestingly the
reboot issue is gone. Here's the crash from back then.

 SError Interrupt on CPU6, code 0xbe000411 -- SError
 CPU: 6 PID: 8772 Comm: reboot Not tainted 5.14.0-rc5-next-20210810+ #1
 Hardware name: Google Lazor (rev3+) with KB Backlight (DT)
 pstate: 004000c9 (nzcv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
 pc : el1_interrupt+0x20/0x60
 lr : el1h_64_irq_handler+0x18/0x24
 sp : ffffffc0114139c0
 x29: ffffffc0114139c0 x28: ffffff808a8b2240 x27: 0000000000000000
 x26: ffffff80817ec018 x25: ffffffd79e8f0000 x24: ffffffd79e957000
 x23: 0000000000400009 x22: ffffffd79dac527c x21: ffffffc011413b40
 x20: ffffffd79d6100f8 x19: ffffffc0114139f0 x18: 0000000000022a07
 x17: 0000000000000000 x16: ffffffd79dac52e4 x15: ffffff80d291fe80
 x14: 0000000000000580 x13: 000000000000300c x12: ffffff80b4f7ed10
 x11: 0000000000000003 x10: 00000000c0000000 x9 : 0000000000000003
 x8 : 00000000000000c0 x7 : bbbbbbbbbbbbbbbb x6 : 0000000000000001
 x5 : 0000000000170006 x4 : ffffff80d291bcc0 x3 : ffffffd79e429b41
 x2 : 0000000000000002 x1 : ffffffd79d6100f8 x0 : ffffffc0114139f0
 Kernel panic - not syncing: Asynchronous SError Interrupt
 CPU: 6 PID: 8772 Comm: reboot Not tainted 5.14.0-rc5-next-20210810+ #1
 Hardware name: Google Lazor (rev3+) with KB Backlight (DT)
 Call trace:
  dump_backtrace+0x0/0x1d4
  show_stack+0x24/0x30
  dump_stack_lvl+0x64/0x7c
  dump_stack+0x18/0x38
  panic+0x150/0x38c
  nmi_panic+0x88/0xa0
  arm64_serror_panic+0x74/0x80
  is_valid_bugaddr+0x0/0x1c
  el1h_64_error_handler+0x30/0x48
  el1h_64_error+0x78/0x7c
  el1_interrupt+0x20/0x60
  el1h_64_irq_handler+0x18/0x24
  el1h_64_irq+0x78/0x7c
  refcount_dec_not_one+0x48/0xb0
  refcount_dec_and_mutex_lock+0x1c/0xb4
  ipa_clock_put+0x34/0x74 [ipa]
  ipa_uc_deconfig+0x4c/0x5c [ipa]
  ipa_deconfig+0x30/0x90 [ipa]
  ipa_remove+0xbc/0x11c [ipa]
  platform_shutdown+0x30/0x3c
  device_shutdown+0x150/0x208
  kernel_restart_prepare+0x44/0x50

>
> ---
>
> The second problem you have is exhibited by the IPA driver if
> the "fix" commit (upstream b95b668eaaa2) is back-ported to the
> Linux 5.10.y LTS branch (along with some other prerequisite
> commits).  We can conclude that applying the above commit
> makes the bandwidth for an unused interconnect (or perhaps
> the rate for the IPA core clock) get set to zero.  And in that
> case, an attempt to access IPA hardware leads to the crash you
> observed.
>
> The IPA driver does not implement runtime power management
> until Linux v5.15.  You later said you thought enabling that
> might ensure the clock and interconnects were active when
> needed by the IPA driver, and I concur (but there could be a
> little more to it).

Is the runtime PM patch series necessary to enable the IPA clk and
interconnects? Things don't look good on 5.10.y and I'm not sure it will
be workable. Commit b1d681d8d324 ("interconnect: Add sync state
support") was introduced in v5.10 and that seems to be the commit that
broke suspend on Lazor.

>
> In any case, based on the time stamp in your log, it seems
> this problem is likely occurring upon the first access to IPA
> hardware.
>
> I have a hunch about what might be happening here.  There is
> some synchronization that must occur between the AP and modem
> when IPA is starting up.  Until that synchronization step has
> completed, we can't allow the IPA network device to be opened.

Is there a commit that implements this? Or how is the synchronization
done? I can debug more and see if that synchronization is happening.

> In later kernels I think this is precluded, but perhaps in
> Linux v5.10 it isn't.  Until I look a little more closely I'm
> not sure what would happen, but it *could* be this.
>
> I'm going to look a little how the particular access that
> caused the crash is prevented in newer kernels.  It could
> be that back-porting that (or re-implementing it for the
> older kernel) will address the crash you're seeing.
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3] interconnect: qcom: icc-rpmh: Add BCMs to commit list in pre_aggregate
  2022-04-11 19:06     ` Stephen Boyd
@ 2022-04-11 20:20       ` Stephen Boyd
  2022-04-12  4:47         ` Stephen Boyd
  2022-04-12  2:18       ` Alex Elder
  2022-04-12  2:20       ` Alex Elder
  2 siblings, 1 reply; 10+ messages in thread
From: Stephen Boyd @ 2022-04-11 20:20 UTC (permalink / raw)
  To: Alex Elder, djakov, quic_mdtipton
  Cc: linux-arm-msm, linux-pm, linux-kernel, mka, dianders,
	Bjorn Andersson, Taniya Das, Sibi Sankar

Quoting Stephen Boyd (2022-04-11 12:06:19)
> Quoting Alex Elder (2022-04-11 08:59:07)
> > The second problem you have is exhibited by the IPA driver if
> > the "fix" commit (upstream b95b668eaaa2) is back-ported to the
> > Linux 5.10.y LTS branch (along with some other prerequisite
> > commits).  We can conclude that applying the above commit
> > makes the bandwidth for an unused interconnect (or perhaps
> > the rate for the IPA core clock) get set to zero.  And in that
> > case, an attempt to access IPA hardware leads to the crash you
> > observed.
> >
> > The IPA driver does not implement runtime power management
> > until Linux v5.15.  You later said you thought enabling that
> > might ensure the clock and interconnects were active when
> > needed by the IPA driver, and I concur (but there could be a
> > little more to it).
>
> Is the runtime PM patch series necessary to enable the IPA clk and
> interconnects? Things don't look good on 5.10.y and I'm not sure it will
> be workable. Commit b1d681d8d324 ("interconnect: Add sync state
> support") was introduced in v5.10 and that seems to be the commit that
> broke suspend on Lazor.
>
> >
> > In any case, based on the time stamp in your log, it seems
> > this problem is likely occurring upon the first access to IPA
> > hardware.
> >
> > I have a hunch about what might be happening here.  There is
> > some synchronization that must occur between the AP and modem
> > when IPA is starting up.  Until that synchronization step has
> > completed, we can't allow the IPA network device to be opened.
>
> Is there a commit that implements this? Or how is the synchronization
> done? I can debug more and see if that synchronization is happening.
>
> > In later kernels I think this is precluded, but perhaps in
> > Linux v5.10 it isn't.  Until I look a little more closely I'm
> > not sure what would happen, but it *could* be this.
> >
> > I'm going to look a little how the particular access that
> > caused the crash is prevented in newer kernels.  It could
> > be that back-porting that (or re-implementing it for the
> > older kernel) will address the crash you're seeing.
> >

I think it's some IPA unclocked access because it's an SError async
abort. I looked some at the IPA clk implementation and it is a little
concerning. I see two problems. The first problem is that clk-rpmh is
only voting on the active state for the IPA clk. Maybe RPMh will move
the sleep/wake state over from the active state automatically as long as
we never make a request in either of those states? I don't know, but it
looks concerning and either needs to be fixed or a big comment
indicating that the copy happens. This patch makes it set a frequency in
each state bucket.

----8<----
diff --git a/drivers/clk/qcom/clk-rpmh.c b/drivers/clk/qcom/clk-rpmh.c
index 74e57c84f60a..03c8c0e7db7a 100644
--- a/drivers/clk/qcom/clk-rpmh.c
+++ b/drivers/clk/qcom/clk-rpmh.c
@@ -260,6 +260,7 @@ static int clk_rpmh_bcm_send_cmd(struct clk_rpmh
*c, bool enable)
 	struct tcs_cmd cmd = { 0 };
 	u32 cmd_state;
 	int ret = 0;
+	enum rpmh_state state;

 	mutex_lock(&rpmh_clk_lock);
 	if (enable) {
@@ -274,15 +275,19 @@ static int clk_rpmh_bcm_send_cmd(struct clk_rpmh
*c, bool enable)
 		cmd.addr = c->res_addr;
 		cmd.data = BCM_TCS_CMD(1, enable, 0, cmd_state);

-		ret = clk_rpmh_send(c, RPMH_ACTIVE_ONLY_STATE, &cmd, enable);
-		if (ret) {
-			dev_err(c->dev, "set active state of %s failed: (%d)\n",
-				c->res_name, ret);
-		} else {
-			c->last_sent_aggr_state = cmd_state;
+		for (state = RPMH_SLEEP_STATE; state <= RPMH_ACTIVE_ONLY_STATE; state++) {
+			ret = clk_rpmh_send(c, state, &cmd, enable);
+			if (ret) {
+				dev_err(c->dev, "set %s state of %s failed: (%d)\n",
+					!state ? "sleep" :
+					state == RPMH_WAKE_ONLY_STATE	?
+					"wake" : "active", c->res_name, ret);
+				goto out;
+			}
 		}
+		c->last_sent_aggr_state = cmd_state;
 	}
-
+out:
 	mutex_unlock(&rpmh_clk_lock);

 	return ret;

---8<-----

The second problem I see is that the IPA clk resource is "IP0" but it is
still defined in the interconnect/qcom/sc7180.c file.

$ git grep \"IP0\" -- drivers/{interconnect,clk}/
drivers/clk/qcom/clk-rpmh.c:DEFINE_CLK_RPMH_BCM(sdm845, ipa, "IP0");
drivers/clk/qcom/clk-rpmh.c:DEFINE_CLK_RPMH_BCM(sdx55, ipa, "IP0");
drivers/interconnect/qcom/sc7180.c:DEFINE_QBCM(bcm_ip0, "IP0", false,
&ipa_core_slave);
drivers/interconnect/qcom/sc8180x.c:DEFINE_QBCM(bcm_ip0, "IP0", false,
&slv_ipa_core_slave);
drivers/interconnect/qcom/sdx55.c:DEFINE_QBCM(bcm_ip0, "IP0", false,
&ipa_core_slave);
drivers/interconnect/qcom/sm8150.c:DEFINE_QBCM(bcm_ip0, "IP0", false,
&ipa_core_slave);
drivers/interconnect/qcom/sm8250.c:DEFINE_QBCM(bcm_ip0, "IP0", false,
&ipa_core_slave);

That sounds pretty bad because it means both interconnect and clk frameworks
are trying to control the same RPMh resource, trampling over each other.
Combine that with unused clks and sync_state support and I don't know
what the state of the IP0 resource really is anymore.

When the clk-rpmh driver got an IPA clk a while ago I didn't want it to
manage this BCM[1]. That's because it doesn't use anything in the clk
framework besides the clk_set_rate() interface. No clk tree management,
or recalc_rate interface, etc. It seems like it kept getting added to
the interconnect framework after sdm845 though. Confusing!

Putting the discussion about how it should be represented in the kernel
aside, I applied this patch and rebooted 30 times and didn't see a
crash. I think this fixed it. I will back out the clk-rpmh patch now and
see if it is still fixed. It would be great if someone from qcom can
tell me what's going on.

diff --git a/drivers/interconnect/qcom/sc7180.c
b/drivers/interconnect/qcom/sc7180.c
index 12d59c36df53..5f7c0f85fa8e 100644
--- a/drivers/interconnect/qcom/sc7180.c
+++ b/drivers/interconnect/qcom/sc7180.c
@@ -47,7 +47,6 @@ DEFINE_QNODE(qnm_mnoc_sf,
SC7180_MASTER_MNOC_SF_MEM_NOC, 1, 32, SC7180_SLAVE_GEM
 DEFINE_QNODE(qnm_snoc_gc, SC7180_MASTER_SNOC_GC_MEM_NOC, 1, 8,
SC7180_SLAVE_LLCC);
 DEFINE_QNODE(qnm_snoc_sf, SC7180_MASTER_SNOC_SF_MEM_NOC, 1, 16,
SC7180_SLAVE_LLCC);
 DEFINE_QNODE(qxm_gpu, SC7180_MASTER_GFX3D, 2, 32,
SC7180_SLAVE_GEM_NOC_SNOC, SC7180_SLAVE_LLCC);
-DEFINE_QNODE(ipa_core_master, SC7180_MASTER_IPA_CORE, 1, 8,
SC7180_SLAVE_IPA_CORE);
 DEFINE_QNODE(llcc_mc, SC7180_MASTER_LLCC, 2, 4, SC7180_SLAVE_EBI1);
 DEFINE_QNODE(qhm_mnoc_cfg, SC7180_MASTER_CNOC_MNOC_CFG, 1, 4,
SC7180_SLAVE_SERVICE_MNOC);
 DEFINE_QNODE(qxm_camnoc_hf0, SC7180_MASTER_CAMNOC_HF0, 2, 32,
SC7180_SLAVE_MNOC_HF_MEM_NOC);
@@ -129,7 +128,6 @@ DEFINE_QNODE(qhs_mdsp_ms_mpu_cfg,
SC7180_SLAVE_MSS_PROC_MS_MPU_CFG, 1, 4);
 DEFINE_QNODE(qns_gem_noc_snoc, SC7180_SLAVE_GEM_NOC_SNOC, 1, 8,
SC7180_MASTER_GEM_NOC_SNOC);
 DEFINE_QNODE(qns_llcc, SC7180_SLAVE_LLCC, 1, 16, SC7180_MASTER_LLCC);
 DEFINE_QNODE(srvc_gemnoc, SC7180_SLAVE_SERVICE_GEM_NOC, 1, 4);
-DEFINE_QNODE(ipa_core_slave, SC7180_SLAVE_IPA_CORE, 1, 8);
 DEFINE_QNODE(ebi, SC7180_SLAVE_EBI1, 2, 4);
 DEFINE_QNODE(qns_mem_noc_hf, SC7180_SLAVE_MNOC_HF_MEM_NOC, 1, 32,
SC7180_MASTER_MNOC_HF_MEM_NOC);
 DEFINE_QNODE(qns_mem_noc_sf, SC7180_SLAVE_MNOC_SF_MEM_NOC, 1, 32,
SC7180_MASTER_MNOC_SF_MEM_NOC);
@@ -160,7 +158,6 @@ DEFINE_QBCM(bcm_mc0, "MC0", true, &ebi);
 DEFINE_QBCM(bcm_sh0, "SH0", true, &qns_llcc);
 DEFINE_QBCM(bcm_mm0, "MM0", false, &qns_mem_noc_hf);
 DEFINE_QBCM(bcm_ce0, "CE0", false, &qxm_crypto);
-DEFINE_QBCM(bcm_ip0, "IP0", false, &ipa_core_slave);
 DEFINE_QBCM(bcm_cn0, "CN0", true, &qnm_snoc, &xm_qdss_dap,
&qhs_a1_noc_cfg, &qhs_a2_noc_cfg, &qhs_ahb2phy0, &qhs_aop, &qhs_aoss,
&qhs_boot_rom, &qhs_camera_cfg, &qhs_camera_nrt_throttle_cfg,
&qhs_camera_rt_throttle_cfg, &qhs_clk_ctl, &qhs_cpr_cx, &qhs_cpr_mx,
&qhs_crypto0_cfg, &qhs_dcc_cfg, &qhs_ddrss_cfg, &qhs_display_cfg,
&qhs_display_rt_throttle_cfg, &qhs_display_throttle_cfg, &qhs_glm,
&qhs_gpuss_cfg, &qhs_imem_cfg, &qhs_ipa, &qhs_mnoc_cfg, &qhs_mss_cfg,
&qhs_npu_cfg, &qhs_npu_dma_throttle_cfg, &qhs_npu_dsp_throttle_cfg,
&qhs_pimem_cfg, &qhs_prng, &qhs_qdss_cfg, &qhs_qm_cfg,
&qhs_qm_mpu_cfg, &qhs_qup0, &qhs_qup1, &qhs_security, &qhs_snoc_cfg,
&qhs_tcsr, &qhs_tlmm_1, &qhs_tlmm_2, &qhs_tlmm_3, &qhs_ufs_mem_cfg,
&qhs_usb3, &qhs_venus_cfg, &qhs_venus_throttle_cfg,
&qhs_vsense_ctrl_cfg, &srvc_cnoc);
 DEFINE_QBCM(bcm_mm1, "MM1", false, &qxm_camnoc_hf0_uncomp,
&qxm_camnoc_hf1_uncomp, &qxm_camnoc_sf_uncomp, &qhm_mnoc_cfg,
&qxm_mdp0, &qxm_rot, &qxm_venus0, &qxm_venus_arm9);
 DEFINE_QBCM(bcm_sh2, "SH2", false, &acm_sys_tcu);
@@ -372,22 +369,6 @@ static struct qcom_icc_desc sc7180_gem_noc = {
 	.num_bcms = ARRAY_SIZE(gem_noc_bcms),
 };

-static struct qcom_icc_bcm *ipa_virt_bcms[] = {
-	&bcm_ip0,
-};
-
-static struct qcom_icc_node *ipa_virt_nodes[] = {
-	[MASTER_IPA_CORE] = &ipa_core_master,
-	[SLAVE_IPA_CORE] = &ipa_core_slave,
-};
-
-static struct qcom_icc_desc sc7180_ipa_virt = {
-	.nodes = ipa_virt_nodes,
-	.num_nodes = ARRAY_SIZE(ipa_virt_nodes),
-	.bcms = ipa_virt_bcms,
-	.num_bcms = ARRAY_SIZE(ipa_virt_bcms),
-};
-
 static struct qcom_icc_bcm *mc_virt_bcms[] = {
 	&bcm_acv,
 	&bcm_mc0,
@@ -519,8 +500,6 @@ static const struct of_device_id qnoc_of_match[] = {
 	  .data = &sc7180_dc_noc},
 	{ .compatible = "qcom,sc7180-gem-noc",
 	  .data = &sc7180_gem_noc},
-	{ .compatible = "qcom,sc7180-ipa-virt",
-	  .data = &sc7180_ipa_virt},
 	{ .compatible = "qcom,sc7180-mc-virt",
 	  .data = &sc7180_mc_virt},
 	{ .compatible = "qcom,sc7180-mmss-noc",

----8<----

[1] https://lore.kernel.org/r/154411934006.88331.2149751521264448532@swboyd.mtv.corp.google.com

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v3] interconnect: qcom: icc-rpmh: Add BCMs to commit list in pre_aggregate
  2022-04-11 19:06     ` Stephen Boyd
  2022-04-11 20:20       ` Stephen Boyd
@ 2022-04-12  2:18       ` Alex Elder
  2022-04-12  2:20       ` Alex Elder
  2 siblings, 0 replies; 10+ messages in thread
From: Alex Elder @ 2022-04-12  2:18 UTC (permalink / raw)
  To: Stephen Boyd, djakov, okukatla, quic_mdtipton
  Cc: linux-arm-msm, linux-pm, linux-kernel, Mike Tipton, mka, dianders

On 4/11/22 2:06 PM, Stephen Boyd wrote:
>> The second problem you have is exhibited by the IPA driver if
>> the "fix" commit (upstream b95b668eaaa2) is back-ported to the
>> Linux 5.10.y LTS branch (along with some other prerequisite
>> commits).  We can conclude that applying the above commit
>> makes the bandwidth for an unused interconnect (or perhaps
>> the rate for the IPA core clock) get set to zero.  And in that
>> case, an attempt to access IPA hardware leads to the crash you
>> observed.
>>
>> The IPA driver does not implement runtime power management
>> until Linux v5.15.  You later said you thought enabling that
>> might ensure the clock and interconnects were active when
>> needed by the IPA driver, and I concur (but there could be a
>> little more to it).
> Is the runtime PM patch series necessary to enable the IPA clk and
> interconnects? Things don't look good on 5.10.y and I'm not sure it will
> be workable. Commit b1d681d8d324 ("interconnect: Add sync state
> support") was introduced in v5.10 and that seems to be the commit that
> broke suspend on Lazor.
> 

This isn't a response to your complete message but I'm going
to respond to this part.

Before runtime PM was in place, the "IPA clock" (which was a
logical notion representing the IPA core clock and all the
interconnects it uses) was enabled before the IPA hardware
was first touched.  It was disabled again when system suspend
occurred, and re-enabled again on system resume.  At one time
we did observe the XO clock turning off.

I'm not sure that answers your question.  But bottom line
is that system suspend/resume were supported (and made
IPA clock+interconnects get shut off and then on again),
but not runtime PM.

					-Alex

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3] interconnect: qcom: icc-rpmh: Add BCMs to commit list in pre_aggregate
  2022-04-11 19:06     ` Stephen Boyd
  2022-04-11 20:20       ` Stephen Boyd
  2022-04-12  2:18       ` Alex Elder
@ 2022-04-12  2:20       ` Alex Elder
  2 siblings, 0 replies; 10+ messages in thread
From: Alex Elder @ 2022-04-12  2:20 UTC (permalink / raw)
  To: Stephen Boyd, djakov, okukatla, quic_mdtipton
  Cc: linux-arm-msm, linux-pm, linux-kernel, Mike Tipton, mka, dianders

On 4/11/22 2:06 PM, Stephen Boyd wrote:
>> I have a hunch about what might be happening here.  There is
>> some synchronization that must occur between the AP and modem
>> when IPA is starting up.  Until that synchronization step has
>> completed, we can't allow the IPA network device to be opened.
> Is there a commit that implements this? Or how is the synchronization
> done? I can debug more and see if that synchronization is happening.

After testing today it seems maybe my hunch wasn't
the root cause.  If you disagree and I'm missing
something, say so.  I will take a look at that
though--my hunch--to see if the thing I thought
could be causing a problem can actually occur.
If so, I'll figure out a fix.

					-Alex


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3] interconnect: qcom: icc-rpmh: Add BCMs to commit list in pre_aggregate
  2022-04-11 20:20       ` Stephen Boyd
@ 2022-04-12  4:47         ` Stephen Boyd
  0 siblings, 0 replies; 10+ messages in thread
From: Stephen Boyd @ 2022-04-12  4:47 UTC (permalink / raw)
  To: Alex Elder, djakov, quic_mdtipton
  Cc: linux-arm-msm, linux-pm, linux-kernel, mka, dianders,
	Bjorn Andersson, Taniya Das, Sibi Sankar

Quoting Stephen Boyd (2022-04-11 13:20:27)
>
> I think it's some IPA unclocked access because it's an SError async
> abort. I looked some at the IPA clk implementation and it is a little
> concerning. I see two problems. The first problem is that clk-rpmh is
> only voting on the active state for the IPA clk. Maybe RPMh will move
> the sleep/wake state over from the active state automatically as long as
> we never make a request in either of those states? I don't know, but it
> looks concerning and either needs to be fixed or a big comment
> indicating that the copy happens. This patch makes it set a frequency in
> each state bucket.
>

This patch doesn't seem to matter. I think that's because RPMh copies
over active state settings to sleep and wake state automatically if
those states are never changed by that processor. Someone at qcom needs
to confirm this theory. I'll send the patch and see if that spurs
someone at qcom to respond.

>
> The second problem I see is that the IPA clk resource is "IP0" but it is
> still defined in the interconnect/qcom/sc7180.c file.
>
> $ git grep \"IP0\" -- drivers/{interconnect,clk}/
> drivers/clk/qcom/clk-rpmh.c:DEFINE_CLK_RPMH_BCM(sdm845, ipa, "IP0");
> drivers/clk/qcom/clk-rpmh.c:DEFINE_CLK_RPMH_BCM(sdx55, ipa, "IP0");
> drivers/interconnect/qcom/sc7180.c:DEFINE_QBCM(bcm_ip0, "IP0", false, &ipa_core_slave);
> drivers/interconnect/qcom/sc8180x.c:DEFINE_QBCM(bcm_ip0, "IP0", false, &slv_ipa_core_slave);
> drivers/interconnect/qcom/sdx55.c:DEFINE_QBCM(bcm_ip0, "IP0", false, &ipa_core_slave);
> drivers/interconnect/qcom/sm8150.c:DEFINE_QBCM(bcm_ip0, "IP0", false, &ipa_core_slave);
> drivers/interconnect/qcom/sm8250.c:DEFINE_QBCM(bcm_ip0, "IP0", false, &ipa_core_slave);
>
> That sounds pretty bad because it means both interconnect and clk frameworks
> are trying to control the same RPMh resource, trampling over each other.
> Combine that with unused clks and sync_state support and I don't know
> what the state of the IP0 resource really is anymore.

Alright, some more debugging has confirmed this. I put a print in the
rpmh driver to figure out when the IP0 resource, according to cmd-db, is
being modified. Luckily, the interconnect driver uses the rpmh batch API
while the clk driver uses the direct write API. I put a dump_stack()
when the batch API is called on the IP0 address and that sometimes gets
zeroed out (i.e. IPA clk is turned off) after the rpmh clk driver turned
it ON. The rpmh driver in the kernel doesn't do any aggregation between
kernel clients, so the problem is this sequence:

	IPA driver probes
	runtime PM get()
	IPA clk enabled -> IP0 resource is ON
	interconnect zeroes out the IP0 resource -> IP0 resource is off
	IPA driver tries to access a register and blows up
	
The interconnect framework is zeroing it out now because there's a
sync_state callback once all drivers have probed. This is why I saw it
more easily when the IPA driver is builtin vs. a module. The IPA driver
is much more likely to have turned on the resource before sync_state is
called, but the use of autosuspend made it variable. If the IPA driver
autosuspend callback is delayed just enough, then IPA will turn on the
IP0 resource via the clk API, sync state will turn it off from the
interconnect framework, and then the pm_runtime_get_sync() in the IPA
driver will skip powering the clk on again because it never turned it
off. The runtime PM state of the device is correct, but the clk is off.
Oops!

I think changing the IPA_AUTOSUSPEND_DELAY define to be multiple seconds
makes it even more likely, because then the IPA device definitely won't
be suspended by the time interconnect sync state happens. Anyway, this
also explains why the IPA runtime PM patch made things better. Sometimes
the IPA device will be runtime suspended, and thus it will power on the
IP0 resource on runtime resume even after interconnect sync state
happened. This is the situation on v5.10.y, where runtime PM isn't
happening at all, but sync state is once commit b95b668eaaa2
("interconnect: qcom: icc-rpmh: Add BCMs to commit list in
pre_aggregate") is included. The IP0 resource is guaranteed to be turned
off after sync state drops the request.

So the fix is simple, completely remove IP0 from the interconnect driver
so that sync state doesn't turn it off. Then the IPA driver without
runtime PM will work because the clk resource is turned on and kept on
until system suspend. When the runtime PM patch in IPA is applied, it
will also work because runtime PM will power things on correctly, or
keep them powered on because autosuspend is delayed.

I'll send this patch to the list shortly. And split it up for the other
drivers too.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2022-04-12  4:47 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-25 17:47 [PATCH v3] interconnect: qcom: icc-rpmh: Add BCMs to commit list in pre_aggregate Georgi Djakov
2022-04-05 23:00 ` Stephen Boyd
2022-04-06  4:47   ` Stephen Boyd
2022-04-08 18:52     ` Stephen Boyd
2022-04-11 15:59   ` Alex Elder
2022-04-11 19:06     ` Stephen Boyd
2022-04-11 20:20       ` Stephen Boyd
2022-04-12  4:47         ` Stephen Boyd
2022-04-12  2:18       ` Alex Elder
2022-04-12  2:20       ` Alex Elder

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).