All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/4] soc/arm64: qcom: Add initial version of bwmon
@ 2022-06-01 10:11 ` Krzysztof Kozlowski
  0 siblings, 0 replies; 52+ messages in thread
From: Krzysztof Kozlowski @ 2022-06-01 10:11 UTC (permalink / raw)
  To: Andy Gross, Bjorn Andersson, Krzysztof Kozlowski, Georgi Djakov,
	Rob Herring, Catalin Marinas, Will Deacon, linux-arm-msm,
	linux-pm, devicetree, linux-kernel, linux-arm-kernel

Hi,

Changes since v3
================
1. Patch #2 (bwmon): remove unused irq_enable (kbuild robot);
   split bwmon_clear() into clearing counters and interrupts, so bwmon_start()
   does not clear the counters twice.

Changes since v2
================
1. Spent a lot of time on benchmarking and learning the BWMON behavior.
2. Drop PM/OPP patch - applied.
3. Patch #1: drop opp-avg-kBps.
4. Patch #2: Add several comments explaining pieces of code and BWMON, extend
   commit msg with measurements, extend help message, add new #defines to document
   some magic values, reorder bwmon clear/disable/enable operations to match
   downstream source and document this with comments, fix unit count from 1 MB
   to 65 kB.
5. Patch #4: drop opp-avg-kBps.
6. Add accumulated Rb tags.

Changes since v1
================
1. Add defconfig change.
2. Fix missing semicolon in MODULE_AUTHOR.
3. Add original downstream (msm-4.9 tree) copyrights to the driver.

Description
===========
BWMON is a data bandwidth monitor providing throughput/bandwidth over certain
interconnect links in a SoC.  It might be used to gather current bus usage and
vote for interconnect bandwidth, thus adjusting the bus speed based on actual
usage.

The work is built on top of Thara Gopinath's patches with several cleanups,
changes and simplifications.

Best regards,
Krzysztof

Krzysztof Kozlowski (4):
  dt-bindings: interconnect: qcom,sdm845-cpu-bwmon: add BWMON device
  soc: qcom: icc-bwmon: Add bandwidth monitoring driver
  arm64: defconfig: enable Qualcomm Bandwidth Monitor
  arm64: dts: qcom: sdm845: Add CPU BWMON

 .../interconnect/qcom,sdm845-cpu-bwmon.yaml   |  97 ++++
 MAINTAINERS                                   |   7 +
 arch/arm64/boot/dts/qcom/sdm845.dtsi          |  54 +++
 arch/arm64/configs/defconfig                  |   1 +
 drivers/soc/qcom/Kconfig                      |  15 +
 drivers/soc/qcom/Makefile                     |   1 +
 drivers/soc/qcom/icc-bwmon.c                  | 421 ++++++++++++++++++
 7 files changed, 596 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/interconnect/qcom,sdm845-cpu-bwmon.yaml
 create mode 100644 drivers/soc/qcom/icc-bwmon.c

-- 
2.34.1


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v4 0/4] soc/arm64: qcom: Add initial version of bwmon
@ 2022-06-01 10:11 ` Krzysztof Kozlowski
  0 siblings, 0 replies; 52+ messages in thread
From: Krzysztof Kozlowski @ 2022-06-01 10:11 UTC (permalink / raw)
  To: Andy Gross, Bjorn Andersson, Krzysztof Kozlowski, Georgi Djakov,
	Rob Herring, Catalin Marinas, Will Deacon, linux-arm-msm,
	linux-pm, devicetree, linux-kernel, linux-arm-kernel

Hi,

Changes since v3
================
1. Patch #2 (bwmon): remove unused irq_enable (kbuild robot);
   split bwmon_clear() into clearing counters and interrupts, so bwmon_start()
   does not clear the counters twice.

Changes since v2
================
1. Spent a lot of time on benchmarking and learning the BWMON behavior.
2. Drop PM/OPP patch - applied.
3. Patch #1: drop opp-avg-kBps.
4. Patch #2: Add several comments explaining pieces of code and BWMON, extend
   commit msg with measurements, extend help message, add new #defines to document
   some magic values, reorder bwmon clear/disable/enable operations to match
   downstream source and document this with comments, fix unit count from 1 MB
   to 65 kB.
5. Patch #4: drop opp-avg-kBps.
6. Add accumulated Rb tags.

Changes since v1
================
1. Add defconfig change.
2. Fix missing semicolon in MODULE_AUTHOR.
3. Add original downstream (msm-4.9 tree) copyrights to the driver.

Description
===========
BWMON is a data bandwidth monitor providing throughput/bandwidth over certain
interconnect links in a SoC.  It might be used to gather current bus usage and
vote for interconnect bandwidth, thus adjusting the bus speed based on actual
usage.

The work is built on top of Thara Gopinath's patches with several cleanups,
changes and simplifications.

Best regards,
Krzysztof

Krzysztof Kozlowski (4):
  dt-bindings: interconnect: qcom,sdm845-cpu-bwmon: add BWMON device
  soc: qcom: icc-bwmon: Add bandwidth monitoring driver
  arm64: defconfig: enable Qualcomm Bandwidth Monitor
  arm64: dts: qcom: sdm845: Add CPU BWMON

 .../interconnect/qcom,sdm845-cpu-bwmon.yaml   |  97 ++++
 MAINTAINERS                                   |   7 +
 arch/arm64/boot/dts/qcom/sdm845.dtsi          |  54 +++
 arch/arm64/configs/defconfig                  |   1 +
 drivers/soc/qcom/Kconfig                      |  15 +
 drivers/soc/qcom/Makefile                     |   1 +
 drivers/soc/qcom/icc-bwmon.c                  | 421 ++++++++++++++++++
 7 files changed, 596 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/interconnect/qcom,sdm845-cpu-bwmon.yaml
 create mode 100644 drivers/soc/qcom/icc-bwmon.c

-- 
2.34.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v4 1/4] dt-bindings: interconnect: qcom,sdm845-cpu-bwmon: add BWMON device
  2022-06-01 10:11 ` Krzysztof Kozlowski
@ 2022-06-01 10:11   ` Krzysztof Kozlowski
  -1 siblings, 0 replies; 52+ messages in thread
From: Krzysztof Kozlowski @ 2022-06-01 10:11 UTC (permalink / raw)
  To: Andy Gross, Bjorn Andersson, Krzysztof Kozlowski, Georgi Djakov,
	Rob Herring, Catalin Marinas, Will Deacon, linux-arm-msm,
	linux-pm, devicetree, linux-kernel, linux-arm-kernel
  Cc: Rob Herring

Add bindings for the Qualcomm Bandwidth Monitor device providing
performance data on interconnects.  The bindings describe only BWMON
version 4, e.g. the instance on SDM845 between CPU and Last Level Cache
Controller.

Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Reviewed-by: Rob Herring <robh@kernel.org>
Acked-by: Georgi Djakov <djakov@kernel.org>
---
 .../interconnect/qcom,sdm845-cpu-bwmon.yaml   | 97 +++++++++++++++++++
 1 file changed, 97 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/interconnect/qcom,sdm845-cpu-bwmon.yaml

diff --git a/Documentation/devicetree/bindings/interconnect/qcom,sdm845-cpu-bwmon.yaml b/Documentation/devicetree/bindings/interconnect/qcom,sdm845-cpu-bwmon.yaml
new file mode 100644
index 000000000000..8c82e06ee432
--- /dev/null
+++ b/Documentation/devicetree/bindings/interconnect/qcom,sdm845-cpu-bwmon.yaml
@@ -0,0 +1,97 @@
+# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/interconnect/qcom,sdm845-cpu-bwmon.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Qualcomm Interconnect Bandwidth Monitor
+
+maintainers:
+  - Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
+
+description:
+  Bandwidth Monitor measures current throughput on buses between various NoC
+  fabrics and provides information when it crosses configured thresholds.
+
+properties:
+  compatible:
+    enum:
+      - qcom,sdm845-cpu-bwmon       # BWMON v4
+
+  interconnects:
+    maxItems: 2
+
+  interconnect-names:
+    items:
+      - const: ddr
+      - const: l3c
+
+  interrupts:
+    maxItems: 1
+
+  operating-points-v2: true
+  opp-table: true
+
+  reg:
+    # Currently described BWMON v4 and v5 use one register address space.
+    # BWMON v2 uses two register spaces - not yet described.
+    maxItems: 1
+
+required:
+  - compatible
+  - interconnects
+  - interconnect-names
+  - interrupts
+  - operating-points-v2
+  - opp-table
+  - reg
+
+additionalProperties: false
+
+examples:
+  - |
+    #include <dt-bindings/interconnect/qcom,osm-l3.h>
+    #include <dt-bindings/interconnect/qcom,sdm845.h>
+    #include <dt-bindings/interrupt-controller/arm-gic.h>
+
+    pmu@1436400 {
+        compatible = "qcom,sdm845-cpu-bwmon";
+        reg = <0x01436400 0x600>;
+
+        interrupts = <GIC_SPI 581 IRQ_TYPE_LEVEL_HIGH>;
+
+        interconnects = <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>,
+                        <&osm_l3 MASTER_OSM_L3_APPS &osm_l3 SLAVE_OSM_L3>;
+        interconnect-names = "ddr", "l3c";
+
+        operating-points-v2 = <&cpu_bwmon_opp_table>;
+
+        cpu_bwmon_opp_table: opp-table {
+            compatible = "operating-points-v2";
+
+            opp-0 {
+                opp-peak-kBps = <800000 4800000>;
+            };
+            opp-1 {
+                opp-peak-kBps = <1804000 9216000>;
+            };
+            opp-2 {
+                opp-peak-kBps = <2188000 11980800>;
+            };
+            opp-3 {
+                opp-peak-kBps = <3072000 15052800>;
+            };
+            opp-4 {
+                opp-peak-kBps = <4068000 19353600>;
+            };
+            opp-5 {
+                opp-peak-kBps = <5412000 20889600>;
+            };
+            opp-6 {
+                opp-peak-kBps = <6220000 22425600>;
+            };
+            opp-7 {
+                opp-peak-kBps = <7216000 25497600>;
+            };
+        };
+    };
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v4 1/4] dt-bindings: interconnect: qcom,sdm845-cpu-bwmon: add BWMON device
@ 2022-06-01 10:11   ` Krzysztof Kozlowski
  0 siblings, 0 replies; 52+ messages in thread
From: Krzysztof Kozlowski @ 2022-06-01 10:11 UTC (permalink / raw)
  To: Andy Gross, Bjorn Andersson, Krzysztof Kozlowski, Georgi Djakov,
	Rob Herring, Catalin Marinas, Will Deacon, linux-arm-msm,
	linux-pm, devicetree, linux-kernel, linux-arm-kernel
  Cc: Rob Herring

Add bindings for the Qualcomm Bandwidth Monitor device providing
performance data on interconnects.  The bindings describe only BWMON
version 4, e.g. the instance on SDM845 between CPU and Last Level Cache
Controller.

Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Reviewed-by: Rob Herring <robh@kernel.org>
Acked-by: Georgi Djakov <djakov@kernel.org>
---
 .../interconnect/qcom,sdm845-cpu-bwmon.yaml   | 97 +++++++++++++++++++
 1 file changed, 97 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/interconnect/qcom,sdm845-cpu-bwmon.yaml

diff --git a/Documentation/devicetree/bindings/interconnect/qcom,sdm845-cpu-bwmon.yaml b/Documentation/devicetree/bindings/interconnect/qcom,sdm845-cpu-bwmon.yaml
new file mode 100644
index 000000000000..8c82e06ee432
--- /dev/null
+++ b/Documentation/devicetree/bindings/interconnect/qcom,sdm845-cpu-bwmon.yaml
@@ -0,0 +1,97 @@
+# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/interconnect/qcom,sdm845-cpu-bwmon.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Qualcomm Interconnect Bandwidth Monitor
+
+maintainers:
+  - Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
+
+description:
+  Bandwidth Monitor measures current throughput on buses between various NoC
+  fabrics and provides information when it crosses configured thresholds.
+
+properties:
+  compatible:
+    enum:
+      - qcom,sdm845-cpu-bwmon       # BWMON v4
+
+  interconnects:
+    maxItems: 2
+
+  interconnect-names:
+    items:
+      - const: ddr
+      - const: l3c
+
+  interrupts:
+    maxItems: 1
+
+  operating-points-v2: true
+  opp-table: true
+
+  reg:
+    # Currently described BWMON v4 and v5 use one register address space.
+    # BWMON v2 uses two register spaces - not yet described.
+    maxItems: 1
+
+required:
+  - compatible
+  - interconnects
+  - interconnect-names
+  - interrupts
+  - operating-points-v2
+  - opp-table
+  - reg
+
+additionalProperties: false
+
+examples:
+  - |
+    #include <dt-bindings/interconnect/qcom,osm-l3.h>
+    #include <dt-bindings/interconnect/qcom,sdm845.h>
+    #include <dt-bindings/interrupt-controller/arm-gic.h>
+
+    pmu@1436400 {
+        compatible = "qcom,sdm845-cpu-bwmon";
+        reg = <0x01436400 0x600>;
+
+        interrupts = <GIC_SPI 581 IRQ_TYPE_LEVEL_HIGH>;
+
+        interconnects = <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>,
+                        <&osm_l3 MASTER_OSM_L3_APPS &osm_l3 SLAVE_OSM_L3>;
+        interconnect-names = "ddr", "l3c";
+
+        operating-points-v2 = <&cpu_bwmon_opp_table>;
+
+        cpu_bwmon_opp_table: opp-table {
+            compatible = "operating-points-v2";
+
+            opp-0 {
+                opp-peak-kBps = <800000 4800000>;
+            };
+            opp-1 {
+                opp-peak-kBps = <1804000 9216000>;
+            };
+            opp-2 {
+                opp-peak-kBps = <2188000 11980800>;
+            };
+            opp-3 {
+                opp-peak-kBps = <3072000 15052800>;
+            };
+            opp-4 {
+                opp-peak-kBps = <4068000 19353600>;
+            };
+            opp-5 {
+                opp-peak-kBps = <5412000 20889600>;
+            };
+            opp-6 {
+                opp-peak-kBps = <6220000 22425600>;
+            };
+            opp-7 {
+                opp-peak-kBps = <7216000 25497600>;
+            };
+        };
+    };
-- 
2.34.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v4 2/4] soc: qcom: icc-bwmon: Add bandwidth monitoring driver
  2022-06-01 10:11 ` Krzysztof Kozlowski
@ 2022-06-01 10:11   ` Krzysztof Kozlowski
  -1 siblings, 0 replies; 52+ messages in thread
From: Krzysztof Kozlowski @ 2022-06-01 10:11 UTC (permalink / raw)
  To: Andy Gross, Bjorn Andersson, Krzysztof Kozlowski, Georgi Djakov,
	Rob Herring, Catalin Marinas, Will Deacon, linux-arm-msm,
	linux-pm, devicetree, linux-kernel, linux-arm-kernel
  Cc: Thara Gopinath

Bandwidth monitoring (BWMON) sits between various subsytems like CPU,
GPU, Last Level caches and memory subsystem.  The BWMON can be
configured to monitor the data throuhput between memory and other
subsytems.  The throughput is measured within specified sampling window
and is used to vote for corresponding interconnect bandwidth.

Current implementation brings support for BWMON v4, used for example on
SDM845 to measure bandwidth between CPU (gladiator_noc) and Last Level
Cache (memnoc).  Usage of this BWMON allows to remove fixed bandwidth
votes from cpufreq (CPU nodes) thus achieve high memory throughput even
with lower CPU frequencies.

Co-developed-by: Thara Gopinath <thara.gopinath@linaro.org>
Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
---
 MAINTAINERS                  |   7 +
 drivers/soc/qcom/Kconfig     |  15 ++
 drivers/soc/qcom/Makefile    |   1 +
 drivers/soc/qcom/icc-bwmon.c | 421 +++++++++++++++++++++++++++++++++++
 4 files changed, 444 insertions(+)
 create mode 100644 drivers/soc/qcom/icc-bwmon.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 6157e706ed02..bc123f706256 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -16376,6 +16376,13 @@ S:	Maintained
 F:	Documentation/devicetree/bindings/i2c/i2c-qcom-cci.txt
 F:	drivers/i2c/busses/i2c-qcom-cci.c
 
+QUALCOMM INTERCONNECT BWMON DRIVER
+M:	Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
+L:	linux-arm-msm@vger.kernel.org
+S:	Maintained
+F:	Documentation/devicetree/bindings/interconnect/qcom,sdm845-cpu-bwmon.yaml
+F:	drivers/soc/qcom/icc-bwmon.c
+
 QUALCOMM IOMMU
 M:	Rob Clark <robdclark@gmail.com>
 L:	iommu@lists.linux-foundation.org
diff --git a/drivers/soc/qcom/Kconfig b/drivers/soc/qcom/Kconfig
index e718b8735444..35c5192dcfc7 100644
--- a/drivers/soc/qcom/Kconfig
+++ b/drivers/soc/qcom/Kconfig
@@ -228,4 +228,19 @@ config QCOM_APR
 	  application processor and QDSP6. APR is
 	  used by audio driver to configure QDSP6
 	  ASM, ADM and AFE modules.
+
+config QCOM_ICC_BWMON
+	tristate "QCOM Interconnect Bandwidth Monitor driver"
+	depends on ARCH_QCOM || COMPILE_TEST
+	select PM_OPP
+	help
+	  Sets up driver monitoring bandwidth on various interconnects and
+	  based on that voting for interconnect bandwidth, adjusting their
+	  speed to current demand.
+	  Current implementation brings support for BWMON v4, used for example
+	  on SDM845 to measure bandwidth between CPU (gladiator_noc) and Last
+	  Level Cache (memnoc).  Usage of this BWMON allows to remove fixed
+	  bandwidth votes from cpufreq (CPU nodes) thus achieve high memory
+	  throughput even with lower CPU frequencies.
+
 endmenu
diff --git a/drivers/soc/qcom/Makefile b/drivers/soc/qcom/Makefile
index 70d5de69fd7b..d66604aff2b0 100644
--- a/drivers/soc/qcom/Makefile
+++ b/drivers/soc/qcom/Makefile
@@ -28,3 +28,4 @@ obj-$(CONFIG_QCOM_LLCC) += llcc-qcom.o
 obj-$(CONFIG_QCOM_RPMHPD) += rpmhpd.o
 obj-$(CONFIG_QCOM_RPMPD) += rpmpd.o
 obj-$(CONFIG_QCOM_KRYO_L2_ACCESSORS) +=	kryo-l2-accessors.o
+obj-$(CONFIG_QCOM_ICC_BWMON)	+= icc-bwmon.o
diff --git a/drivers/soc/qcom/icc-bwmon.c b/drivers/soc/qcom/icc-bwmon.c
new file mode 100644
index 000000000000..1eed075545db
--- /dev/null
+++ b/drivers/soc/qcom/icc-bwmon.c
@@ -0,0 +1,421 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2014-2018, The Linux Foundation. All rights reserved.
+ * Copyright (C) 2021-2022 Linaro Ltd
+ * Author: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>, based on
+ *         previous work of Thara Gopinath and msm-4.9 downstream sources.
+ */
+#include <linux/interconnect.h>
+#include <linux/interrupt.h>
+#include <linux/io.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/of_device.h>
+#include <linux/platform_device.h>
+#include <linux/pm_opp.h>
+#include <linux/sizes.h>
+
+/*
+ * The BWMON samples data throughput within 'sample_ms' time. With three
+ * configurable thresholds (Low, Medium and High) gives four windows (called
+ * zones) of current bandwidth:
+ *
+ * Zone 0: byte count < THRES_LO
+ * Zone 1: THRES_LO < byte count < THRES_MED
+ * Zone 2: THRES_MED < byte count < THRES_HIGH
+ * Zone 3: THRES_HIGH < byte count
+ *
+ * Zones 0 and 2 are not used by this driver.
+ */
+
+/* Internal sampling clock frequency */
+#define HW_TIMER_HZ				19200000
+
+#define BWMON_GLOBAL_IRQ_STATUS			0x0
+#define BWMON_GLOBAL_IRQ_CLEAR			0x8
+#define BWMON_GLOBAL_IRQ_ENABLE			0xc
+#define BWMON_GLOBAL_IRQ_ENABLE_ENABLE		BIT(0)
+
+#define BWMON_IRQ_STATUS			0x100
+#define BWMON_IRQ_STATUS_ZONE_SHIFT		4
+#define BWMON_IRQ_CLEAR				0x108
+#define BWMON_IRQ_ENABLE			0x10c
+#define BWMON_IRQ_ENABLE_ZONE1_SHIFT		5
+#define BWMON_IRQ_ENABLE_ZONE2_SHIFT		6
+#define BWMON_IRQ_ENABLE_ZONE3_SHIFT		7
+#define BWMON_IRQ_ENABLE_MASK			(BIT(BWMON_IRQ_ENABLE_ZONE1_SHIFT) | \
+						 BIT(BWMON_IRQ_ENABLE_ZONE3_SHIFT))
+
+#define BWMON_ENABLE				0x2a0
+#define BWMON_ENABLE_ENABLE			BIT(0)
+
+#define BWMON_CLEAR				0x2a4
+#define BWMON_CLEAR_CLEAR			BIT(0)
+
+#define BWMON_SAMPLE_WINDOW			0x2a8
+#define BWMON_THRESHOLD_HIGH			0x2ac
+#define BWMON_THRESHOLD_MED			0x2b0
+#define BWMON_THRESHOLD_LOW			0x2b4
+
+#define BWMON_ZONE_ACTIONS			0x2b8
+/*
+ * Actions to perform on some zone 'z' when current zone hits the threshold:
+ * Increment counter of zone 'z'
+ */
+#define BWMON_ZONE_ACTIONS_INCREMENT(z)		(0x2 << ((z) * 2))
+/* Clear counter of zone 'z' */
+#define BWMON_ZONE_ACTIONS_CLEAR(z)		(0x1 << ((z) * 2))
+
+/* Zone 0 threshold hit: Clear zone count */
+#define BWMON_ZONE_ACTIONS_ZONE0		(BWMON_ZONE_ACTIONS_CLEAR(0))
+
+/* Zone 1 threshold hit: Increment zone count & clear lower zones */
+#define BWMON_ZONE_ACTIONS_ZONE1		(BWMON_ZONE_ACTIONS_INCREMENT(1) | \
+						 BWMON_ZONE_ACTIONS_CLEAR(0))
+
+/* Zone 2 threshold hit: Increment zone count & clear lower zones */
+#define BWMON_ZONE_ACTIONS_ZONE2		(BWMON_ZONE_ACTIONS_INCREMENT(2) | \
+						 BWMON_ZONE_ACTIONS_CLEAR(1) | \
+						 BWMON_ZONE_ACTIONS_CLEAR(0))
+
+/* Zone 3 threshold hit: Increment zone count & clear lower zones */
+#define BWMON_ZONE_ACTIONS_ZONE3		(BWMON_ZONE_ACTIONS_INCREMENT(3) | \
+						 BWMON_ZONE_ACTIONS_CLEAR(2) | \
+						 BWMON_ZONE_ACTIONS_CLEAR(1) | \
+						 BWMON_ZONE_ACTIONS_CLEAR(0))
+/* Value for BWMON_ZONE_ACTIONS */
+#define BWMON_ZONE_ACTIONS_DEFAULT		(BWMON_ZONE_ACTIONS_ZONE0 | \
+						 BWMON_ZONE_ACTIONS_ZONE1 << 8 | \
+						 BWMON_ZONE_ACTIONS_ZONE2 << 16 | \
+						 BWMON_ZONE_ACTIONS_ZONE3 << 24)
+
+/*
+ * There is no clear documentation/explanation of BWMON_THRESHOLD_COUNT
+ * register. Based on observations, this is number of times one threshold has to
+ * be reached, to trigger interrupt in given zone.
+ *
+ * 0xff are maximum values meant to ignore the zones 0 and 2.
+ */
+#define BWMON_THRESHOLD_COUNT			0x2bc
+#define BWMON_THRESHOLD_COUNT_ZONE1_SHIFT	8
+#define BWMON_THRESHOLD_COUNT_ZONE2_SHIFT	16
+#define BWMON_THRESHOLD_COUNT_ZONE3_SHIFT	24
+#define BWMON_THRESHOLD_COUNT_ZONE0_DEFAULT	0xff
+#define BWMON_THRESHOLD_COUNT_ZONE2_DEFAULT	0xff
+
+/* BWMONv4 count registers use count unit of 64 kB */
+#define BWMON_COUNT_UNIT_KB			64
+#define BWMON_ZONE_COUNT			0x2d8
+#define BWMON_ZONE_MAX(zone)			(0x2e0 + 4 * (zone))
+
+struct icc_bwmon_data {
+	unsigned int sample_ms;
+	unsigned int default_highbw_kbps;
+	unsigned int default_medbw_kbps;
+	unsigned int default_lowbw_kbps;
+	u8 zone1_thres_count;
+	u8 zone3_thres_count;
+};
+
+struct icc_bwmon {
+	struct device *dev;
+	void __iomem *base;
+	int irq;
+
+	unsigned int default_lowbw_kbps;
+	unsigned int sample_ms;
+	unsigned int max_bw_kbps;
+	unsigned int min_bw_kbps;
+	unsigned int target_kbps;
+	unsigned int current_kbps;
+};
+
+static void bwmon_clear_counters(struct icc_bwmon *bwmon)
+{
+	/*
+	 * Clear counters. The order and barriers are
+	 * important. Quoting downstream Qualcomm msm-4.9 tree:
+	 *
+	 * The counter clear and IRQ clear bits are not in the same 4KB
+	 * region. So, we need to make sure the counter clear is completed
+	 * before we try to clear the IRQ or do any other counter operations.
+	 */
+	writel(BWMON_CLEAR_CLEAR, bwmon->base + BWMON_CLEAR);
+}
+
+static void bwmon_clear_irq(struct icc_bwmon *bwmon)
+{
+	/*
+	 * Clear zone and global interrupts. The order and barriers are
+	 * important. Quoting downstream Qualcomm msm-4.9 tree:
+	 *
+	 * Synchronize the local interrupt clear in mon_irq_clear()
+	 * with the global interrupt clear here. Otherwise, the CPU
+	 * may reorder the two writes and clear the global interrupt
+	 * before the local interrupt, causing the global interrupt
+	 * to be retriggered by the local interrupt still being high.
+	 *
+	 * Similarly, because the global registers are in a different
+	 * region than the local registers, we need to ensure any register
+	 * writes to enable the monitor after this call are ordered with the
+	 * clearing here so that local writes don't happen before the
+	 * interrupt is cleared.
+	 */
+	writel(BWMON_IRQ_ENABLE_MASK, bwmon->base + BWMON_IRQ_CLEAR);
+	writel(BIT(0), bwmon->base + BWMON_GLOBAL_IRQ_CLEAR);
+}
+
+static void bwmon_disable(struct icc_bwmon *bwmon)
+{
+	/* Disable interrupts. Strict ordering, see bwmon_clear_irq(). */
+	writel(0x0, bwmon->base + BWMON_GLOBAL_IRQ_ENABLE);
+	writel(0x0, bwmon->base + BWMON_IRQ_ENABLE);
+
+	/*
+	 * Disable bwmon. Must happen before bwmon_clear_irq() to avoid spurious
+	 * IRQ.
+	 */
+	writel(0x0, bwmon->base + BWMON_ENABLE);
+}
+
+static void bwmon_enable(struct icc_bwmon *bwmon, unsigned int irq_enable)
+{
+	/* Enable interrupts */
+	writel(BWMON_GLOBAL_IRQ_ENABLE_ENABLE,
+	       bwmon->base + BWMON_GLOBAL_IRQ_ENABLE);
+	writel(irq_enable, bwmon->base + BWMON_IRQ_ENABLE);
+
+	/* Enable bwmon */
+	writel(BWMON_ENABLE_ENABLE, bwmon->base + BWMON_ENABLE);
+}
+
+static unsigned int bwmon_kbps_to_count(unsigned int kbps)
+{
+	return kbps / BWMON_COUNT_UNIT_KB;
+}
+
+static void bwmon_set_threshold(struct icc_bwmon *bwmon, unsigned int reg,
+				unsigned int kbps)
+{
+	unsigned int thres;
+
+	thres = mult_frac(bwmon_kbps_to_count(kbps), bwmon->sample_ms,
+			  MSEC_PER_SEC);
+	writel_relaxed(thres, bwmon->base + reg);
+}
+
+static void bwmon_start(struct icc_bwmon *bwmon,
+			const struct icc_bwmon_data *data)
+{
+	unsigned int thres_count;
+	int window;
+
+	bwmon_clear_counters(bwmon);
+
+	window = mult_frac(bwmon->sample_ms, HW_TIMER_HZ, MSEC_PER_SEC);
+	/* Maximum sampling window: 0xfffff */
+	writel_relaxed(window, bwmon->base + BWMON_SAMPLE_WINDOW);
+
+	bwmon_set_threshold(bwmon, BWMON_THRESHOLD_HIGH,
+			    data->default_highbw_kbps);
+	bwmon_set_threshold(bwmon, BWMON_THRESHOLD_MED,
+			    data->default_medbw_kbps);
+	bwmon_set_threshold(bwmon, BWMON_THRESHOLD_LOW,
+			    data->default_lowbw_kbps);
+
+	thres_count = data->zone3_thres_count << BWMON_THRESHOLD_COUNT_ZONE3_SHIFT |
+		      BWMON_THRESHOLD_COUNT_ZONE2_DEFAULT << BWMON_THRESHOLD_COUNT_ZONE2_SHIFT |
+		      data->zone1_thres_count << BWMON_THRESHOLD_COUNT_ZONE1_SHIFT |
+		      BWMON_THRESHOLD_COUNT_ZONE0_DEFAULT;
+	writel_relaxed(thres_count, bwmon->base + BWMON_THRESHOLD_COUNT);
+	writel_relaxed(BWMON_ZONE_ACTIONS_DEFAULT,
+		       bwmon->base + BWMON_ZONE_ACTIONS);
+	/* Write barriers in bwmon_clear_irq() */
+
+	bwmon_clear_irq(bwmon);
+	bwmon_enable(bwmon, BWMON_IRQ_ENABLE_MASK);
+}
+
+static irqreturn_t bwmon_intr(int irq, void *dev_id)
+{
+	struct icc_bwmon *bwmon = dev_id;
+	unsigned int status, max;
+	int zone;
+
+	status = readl(bwmon->base + BWMON_IRQ_STATUS);
+	status &= BWMON_IRQ_ENABLE_MASK;
+	if (!status) {
+		/*
+		 * Only zone 1 and zone 3 interrupts are enabled but zone 2
+		 * threshold could be hit and trigger interrupt even if not
+		 * enabled.
+		 * Such spurious interrupt might come with valuable max count or
+		 * not, so solution would be to always check all
+		 * BWMON_ZONE_MAX() registers to find the highest value.
+		 * Such case is currently ignored.
+		 */
+		return IRQ_NONE;
+	}
+
+	bwmon_disable(bwmon);
+
+	zone = get_bitmask_order(status >> BWMON_IRQ_STATUS_ZONE_SHIFT) - 1;
+	/*
+	 * Zone max bytes count register returns count units within sampling
+	 * window.  Downstream kernel for BWMONv4 (called BWMON type 2 in
+	 * downstream) always increments the max bytes count by one.
+	 */
+	max = readl(bwmon->base + BWMON_ZONE_MAX(zone)) + 1;
+	max *= BWMON_COUNT_UNIT_KB;
+	bwmon->target_kbps = mult_frac(max, MSEC_PER_SEC, bwmon->sample_ms);
+
+	return IRQ_WAKE_THREAD;
+}
+
+static irqreturn_t bwmon_intr_thread(int irq, void *dev_id)
+{
+	struct icc_bwmon *bwmon = dev_id;
+	unsigned int irq_enable = 0;
+	struct dev_pm_opp *opp, *target_opp;
+	unsigned int bw_kbps, up_kbps, down_kbps;
+
+	bw_kbps = bwmon->target_kbps;
+
+	target_opp = dev_pm_opp_find_bw_ceil(bwmon->dev, &bw_kbps, 0);
+	if (IS_ERR(target_opp) && PTR_ERR(target_opp) == -ERANGE)
+		target_opp = dev_pm_opp_find_bw_floor(bwmon->dev, &bw_kbps, 0);
+
+	bwmon->target_kbps = bw_kbps;
+
+	bw_kbps--;
+	opp = dev_pm_opp_find_bw_floor(bwmon->dev, &bw_kbps, 0);
+	if (IS_ERR(opp) && PTR_ERR(opp) == -ERANGE)
+		down_kbps = bwmon->target_kbps;
+	else
+		down_kbps = bw_kbps;
+
+	up_kbps = bwmon->target_kbps + 1;
+
+	if (bwmon->target_kbps >= bwmon->max_bw_kbps)
+		irq_enable = BIT(BWMON_IRQ_ENABLE_ZONE1_SHIFT);
+	else if (bwmon->target_kbps <= bwmon->min_bw_kbps)
+		irq_enable = BIT(BWMON_IRQ_ENABLE_ZONE3_SHIFT);
+	else
+		irq_enable = BWMON_IRQ_ENABLE_MASK;
+
+	bwmon_set_threshold(bwmon, BWMON_THRESHOLD_HIGH, up_kbps);
+	bwmon_set_threshold(bwmon, BWMON_THRESHOLD_MED, down_kbps);
+	/* Write barriers in bwmon_clear_counters() */
+	bwmon_clear_counters(bwmon);
+	bwmon_clear_irq(bwmon);
+	bwmon_enable(bwmon, irq_enable);
+
+	if (bwmon->target_kbps == bwmon->current_kbps)
+		goto out;
+
+	dev_pm_opp_set_opp(bwmon->dev, target_opp);
+	bwmon->current_kbps = bwmon->target_kbps;
+
+out:
+	dev_pm_opp_put(target_opp);
+	if (!IS_ERR(opp))
+		dev_pm_opp_put(opp);
+
+	return IRQ_HANDLED;
+}
+
+static int bwmon_probe(struct platform_device *pdev)
+{
+	struct device *dev = &pdev->dev;
+	struct dev_pm_opp *opp;
+	struct icc_bwmon *bwmon;
+	const struct icc_bwmon_data *data;
+	int ret;
+
+	bwmon = devm_kzalloc(dev, sizeof(*bwmon), GFP_KERNEL);
+	if (!bwmon)
+		return -ENOMEM;
+
+	data = of_device_get_match_data(dev);
+
+	bwmon->base = devm_platform_ioremap_resource(pdev, 0);
+	if (IS_ERR(bwmon->base)) {
+		dev_err(dev, "failed to map bwmon registers\n");
+		return PTR_ERR(bwmon->base);
+	}
+
+	bwmon->irq = platform_get_irq(pdev, 0);
+	if (bwmon->irq < 0) {
+		dev_err(dev, "failed to acquire bwmon IRQ\n");
+		return bwmon->irq;
+	}
+
+	ret = devm_pm_opp_of_add_table(dev);
+	if (ret)
+		return dev_err_probe(dev, ret, "failed to add OPP table\n");
+
+	bwmon->max_bw_kbps = UINT_MAX;
+	opp = dev_pm_opp_find_bw_floor(dev, &bwmon->max_bw_kbps, 0);
+	if (IS_ERR(opp))
+		return dev_err_probe(dev, ret, "failed to find max peak bandwidth\n");
+
+	bwmon->min_bw_kbps = 0;
+	opp = dev_pm_opp_find_bw_ceil(dev, &bwmon->min_bw_kbps, 0);
+	if (IS_ERR(opp))
+		return dev_err_probe(dev, ret, "failed to find min peak bandwidth\n");
+
+	bwmon->sample_ms = data->sample_ms;
+	bwmon->default_lowbw_kbps = data->default_lowbw_kbps;
+	bwmon->dev = dev;
+
+	bwmon_disable(bwmon);
+	ret = devm_request_threaded_irq(dev, bwmon->irq, bwmon_intr,
+					bwmon_intr_thread,
+					IRQF_ONESHOT, dev_name(dev), bwmon);
+	if (ret)
+		return dev_err_probe(dev, ret, "failed to request IRQ\n");
+
+	platform_set_drvdata(pdev, bwmon);
+	bwmon_start(bwmon, data);
+
+	return 0;
+}
+
+static int bwmon_remove(struct platform_device *pdev)
+{
+	struct icc_bwmon *bwmon = platform_get_drvdata(pdev);
+
+	bwmon_disable(bwmon);
+
+	return 0;
+}
+
+/* BWMON v4 */
+static const struct icc_bwmon_data sdm845_bwmon_data = {
+	.sample_ms = 4,
+	.default_highbw_kbps = 4800 * 1024, /* 4.8 GBps */
+	.default_medbw_kbps = 512 * 1024, /* 512 MBps */
+	.default_lowbw_kbps = 0,
+	.zone1_thres_count = 16,
+	.zone3_thres_count = 1,
+};
+
+static const struct of_device_id bwmon_of_match[] = {
+	{ .compatible = "qcom,sdm845-cpu-bwmon", .data = &sdm845_bwmon_data },
+	{}
+};
+MODULE_DEVICE_TABLE(of, bwmon_of_match);
+
+static struct platform_driver bwmon_driver = {
+	.probe = bwmon_probe,
+	.remove = bwmon_remove,
+	.driver = {
+		.name = "qcom-bwmon",
+		.of_match_table = bwmon_of_match,
+	},
+};
+module_platform_driver(bwmon_driver);
+
+MODULE_AUTHOR("Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>");
+MODULE_DESCRIPTION("QCOM BWMON driver");
+MODULE_LICENSE("GPL");
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v4 2/4] soc: qcom: icc-bwmon: Add bandwidth monitoring driver
@ 2022-06-01 10:11   ` Krzysztof Kozlowski
  0 siblings, 0 replies; 52+ messages in thread
From: Krzysztof Kozlowski @ 2022-06-01 10:11 UTC (permalink / raw)
  To: Andy Gross, Bjorn Andersson, Krzysztof Kozlowski, Georgi Djakov,
	Rob Herring, Catalin Marinas, Will Deacon, linux-arm-msm,
	linux-pm, devicetree, linux-kernel, linux-arm-kernel
  Cc: Thara Gopinath

Bandwidth monitoring (BWMON) sits between various subsytems like CPU,
GPU, Last Level caches and memory subsystem.  The BWMON can be
configured to monitor the data throuhput between memory and other
subsytems.  The throughput is measured within specified sampling window
and is used to vote for corresponding interconnect bandwidth.

Current implementation brings support for BWMON v4, used for example on
SDM845 to measure bandwidth between CPU (gladiator_noc) and Last Level
Cache (memnoc).  Usage of this BWMON allows to remove fixed bandwidth
votes from cpufreq (CPU nodes) thus achieve high memory throughput even
with lower CPU frequencies.

Co-developed-by: Thara Gopinath <thara.gopinath@linaro.org>
Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
---
 MAINTAINERS                  |   7 +
 drivers/soc/qcom/Kconfig     |  15 ++
 drivers/soc/qcom/Makefile    |   1 +
 drivers/soc/qcom/icc-bwmon.c | 421 +++++++++++++++++++++++++++++++++++
 4 files changed, 444 insertions(+)
 create mode 100644 drivers/soc/qcom/icc-bwmon.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 6157e706ed02..bc123f706256 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -16376,6 +16376,13 @@ S:	Maintained
 F:	Documentation/devicetree/bindings/i2c/i2c-qcom-cci.txt
 F:	drivers/i2c/busses/i2c-qcom-cci.c
 
+QUALCOMM INTERCONNECT BWMON DRIVER
+M:	Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
+L:	linux-arm-msm@vger.kernel.org
+S:	Maintained
+F:	Documentation/devicetree/bindings/interconnect/qcom,sdm845-cpu-bwmon.yaml
+F:	drivers/soc/qcom/icc-bwmon.c
+
 QUALCOMM IOMMU
 M:	Rob Clark <robdclark@gmail.com>
 L:	iommu@lists.linux-foundation.org
diff --git a/drivers/soc/qcom/Kconfig b/drivers/soc/qcom/Kconfig
index e718b8735444..35c5192dcfc7 100644
--- a/drivers/soc/qcom/Kconfig
+++ b/drivers/soc/qcom/Kconfig
@@ -228,4 +228,19 @@ config QCOM_APR
 	  application processor and QDSP6. APR is
 	  used by audio driver to configure QDSP6
 	  ASM, ADM and AFE modules.
+
+config QCOM_ICC_BWMON
+	tristate "QCOM Interconnect Bandwidth Monitor driver"
+	depends on ARCH_QCOM || COMPILE_TEST
+	select PM_OPP
+	help
+	  Sets up driver monitoring bandwidth on various interconnects and
+	  based on that voting for interconnect bandwidth, adjusting their
+	  speed to current demand.
+	  Current implementation brings support for BWMON v4, used for example
+	  on SDM845 to measure bandwidth between CPU (gladiator_noc) and Last
+	  Level Cache (memnoc).  Usage of this BWMON allows to remove fixed
+	  bandwidth votes from cpufreq (CPU nodes) thus achieve high memory
+	  throughput even with lower CPU frequencies.
+
 endmenu
diff --git a/drivers/soc/qcom/Makefile b/drivers/soc/qcom/Makefile
index 70d5de69fd7b..d66604aff2b0 100644
--- a/drivers/soc/qcom/Makefile
+++ b/drivers/soc/qcom/Makefile
@@ -28,3 +28,4 @@ obj-$(CONFIG_QCOM_LLCC) += llcc-qcom.o
 obj-$(CONFIG_QCOM_RPMHPD) += rpmhpd.o
 obj-$(CONFIG_QCOM_RPMPD) += rpmpd.o
 obj-$(CONFIG_QCOM_KRYO_L2_ACCESSORS) +=	kryo-l2-accessors.o
+obj-$(CONFIG_QCOM_ICC_BWMON)	+= icc-bwmon.o
diff --git a/drivers/soc/qcom/icc-bwmon.c b/drivers/soc/qcom/icc-bwmon.c
new file mode 100644
index 000000000000..1eed075545db
--- /dev/null
+++ b/drivers/soc/qcom/icc-bwmon.c
@@ -0,0 +1,421 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2014-2018, The Linux Foundation. All rights reserved.
+ * Copyright (C) 2021-2022 Linaro Ltd
+ * Author: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>, based on
+ *         previous work of Thara Gopinath and msm-4.9 downstream sources.
+ */
+#include <linux/interconnect.h>
+#include <linux/interrupt.h>
+#include <linux/io.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/of_device.h>
+#include <linux/platform_device.h>
+#include <linux/pm_opp.h>
+#include <linux/sizes.h>
+
+/*
+ * The BWMON samples data throughput within 'sample_ms' time. With three
+ * configurable thresholds (Low, Medium and High) gives four windows (called
+ * zones) of current bandwidth:
+ *
+ * Zone 0: byte count < THRES_LO
+ * Zone 1: THRES_LO < byte count < THRES_MED
+ * Zone 2: THRES_MED < byte count < THRES_HIGH
+ * Zone 3: THRES_HIGH < byte count
+ *
+ * Zones 0 and 2 are not used by this driver.
+ */
+
+/* Internal sampling clock frequency */
+#define HW_TIMER_HZ				19200000
+
+#define BWMON_GLOBAL_IRQ_STATUS			0x0
+#define BWMON_GLOBAL_IRQ_CLEAR			0x8
+#define BWMON_GLOBAL_IRQ_ENABLE			0xc
+#define BWMON_GLOBAL_IRQ_ENABLE_ENABLE		BIT(0)
+
+#define BWMON_IRQ_STATUS			0x100
+#define BWMON_IRQ_STATUS_ZONE_SHIFT		4
+#define BWMON_IRQ_CLEAR				0x108
+#define BWMON_IRQ_ENABLE			0x10c
+#define BWMON_IRQ_ENABLE_ZONE1_SHIFT		5
+#define BWMON_IRQ_ENABLE_ZONE2_SHIFT		6
+#define BWMON_IRQ_ENABLE_ZONE3_SHIFT		7
+#define BWMON_IRQ_ENABLE_MASK			(BIT(BWMON_IRQ_ENABLE_ZONE1_SHIFT) | \
+						 BIT(BWMON_IRQ_ENABLE_ZONE3_SHIFT))
+
+#define BWMON_ENABLE				0x2a0
+#define BWMON_ENABLE_ENABLE			BIT(0)
+
+#define BWMON_CLEAR				0x2a4
+#define BWMON_CLEAR_CLEAR			BIT(0)
+
+#define BWMON_SAMPLE_WINDOW			0x2a8
+#define BWMON_THRESHOLD_HIGH			0x2ac
+#define BWMON_THRESHOLD_MED			0x2b0
+#define BWMON_THRESHOLD_LOW			0x2b4
+
+#define BWMON_ZONE_ACTIONS			0x2b8
+/*
+ * Actions to perform on some zone 'z' when current zone hits the threshold:
+ * Increment counter of zone 'z'
+ */
+#define BWMON_ZONE_ACTIONS_INCREMENT(z)		(0x2 << ((z) * 2))
+/* Clear counter of zone 'z' */
+#define BWMON_ZONE_ACTIONS_CLEAR(z)		(0x1 << ((z) * 2))
+
+/* Zone 0 threshold hit: Clear zone count */
+#define BWMON_ZONE_ACTIONS_ZONE0		(BWMON_ZONE_ACTIONS_CLEAR(0))
+
+/* Zone 1 threshold hit: Increment zone count & clear lower zones */
+#define BWMON_ZONE_ACTIONS_ZONE1		(BWMON_ZONE_ACTIONS_INCREMENT(1) | \
+						 BWMON_ZONE_ACTIONS_CLEAR(0))
+
+/* Zone 2 threshold hit: Increment zone count & clear lower zones */
+#define BWMON_ZONE_ACTIONS_ZONE2		(BWMON_ZONE_ACTIONS_INCREMENT(2) | \
+						 BWMON_ZONE_ACTIONS_CLEAR(1) | \
+						 BWMON_ZONE_ACTIONS_CLEAR(0))
+
+/* Zone 3 threshold hit: Increment zone count & clear lower zones */
+#define BWMON_ZONE_ACTIONS_ZONE3		(BWMON_ZONE_ACTIONS_INCREMENT(3) | \
+						 BWMON_ZONE_ACTIONS_CLEAR(2) | \
+						 BWMON_ZONE_ACTIONS_CLEAR(1) | \
+						 BWMON_ZONE_ACTIONS_CLEAR(0))
+/* Value for BWMON_ZONE_ACTIONS */
+#define BWMON_ZONE_ACTIONS_DEFAULT		(BWMON_ZONE_ACTIONS_ZONE0 | \
+						 BWMON_ZONE_ACTIONS_ZONE1 << 8 | \
+						 BWMON_ZONE_ACTIONS_ZONE2 << 16 | \
+						 BWMON_ZONE_ACTIONS_ZONE3 << 24)
+
+/*
+ * There is no clear documentation/explanation of BWMON_THRESHOLD_COUNT
+ * register. Based on observations, this is number of times one threshold has to
+ * be reached, to trigger interrupt in given zone.
+ *
+ * 0xff are maximum values meant to ignore the zones 0 and 2.
+ */
+#define BWMON_THRESHOLD_COUNT			0x2bc
+#define BWMON_THRESHOLD_COUNT_ZONE1_SHIFT	8
+#define BWMON_THRESHOLD_COUNT_ZONE2_SHIFT	16
+#define BWMON_THRESHOLD_COUNT_ZONE3_SHIFT	24
+#define BWMON_THRESHOLD_COUNT_ZONE0_DEFAULT	0xff
+#define BWMON_THRESHOLD_COUNT_ZONE2_DEFAULT	0xff
+
+/* BWMONv4 count registers use count unit of 64 kB */
+#define BWMON_COUNT_UNIT_KB			64
+#define BWMON_ZONE_COUNT			0x2d8
+#define BWMON_ZONE_MAX(zone)			(0x2e0 + 4 * (zone))
+
+struct icc_bwmon_data {
+	unsigned int sample_ms;
+	unsigned int default_highbw_kbps;
+	unsigned int default_medbw_kbps;
+	unsigned int default_lowbw_kbps;
+	u8 zone1_thres_count;
+	u8 zone3_thres_count;
+};
+
+struct icc_bwmon {
+	struct device *dev;
+	void __iomem *base;
+	int irq;
+
+	unsigned int default_lowbw_kbps;
+	unsigned int sample_ms;
+	unsigned int max_bw_kbps;
+	unsigned int min_bw_kbps;
+	unsigned int target_kbps;
+	unsigned int current_kbps;
+};
+
+static void bwmon_clear_counters(struct icc_bwmon *bwmon)
+{
+	/*
+	 * Clear counters. The order and barriers are
+	 * important. Quoting downstream Qualcomm msm-4.9 tree:
+	 *
+	 * The counter clear and IRQ clear bits are not in the same 4KB
+	 * region. So, we need to make sure the counter clear is completed
+	 * before we try to clear the IRQ or do any other counter operations.
+	 */
+	writel(BWMON_CLEAR_CLEAR, bwmon->base + BWMON_CLEAR);
+}
+
+static void bwmon_clear_irq(struct icc_bwmon *bwmon)
+{
+	/*
+	 * Clear zone and global interrupts. The order and barriers are
+	 * important. Quoting downstream Qualcomm msm-4.9 tree:
+	 *
+	 * Synchronize the local interrupt clear in mon_irq_clear()
+	 * with the global interrupt clear here. Otherwise, the CPU
+	 * may reorder the two writes and clear the global interrupt
+	 * before the local interrupt, causing the global interrupt
+	 * to be retriggered by the local interrupt still being high.
+	 *
+	 * Similarly, because the global registers are in a different
+	 * region than the local registers, we need to ensure any register
+	 * writes to enable the monitor after this call are ordered with the
+	 * clearing here so that local writes don't happen before the
+	 * interrupt is cleared.
+	 */
+	writel(BWMON_IRQ_ENABLE_MASK, bwmon->base + BWMON_IRQ_CLEAR);
+	writel(BIT(0), bwmon->base + BWMON_GLOBAL_IRQ_CLEAR);
+}
+
+static void bwmon_disable(struct icc_bwmon *bwmon)
+{
+	/* Disable interrupts. Strict ordering, see bwmon_clear_irq(). */
+	writel(0x0, bwmon->base + BWMON_GLOBAL_IRQ_ENABLE);
+	writel(0x0, bwmon->base + BWMON_IRQ_ENABLE);
+
+	/*
+	 * Disable bwmon. Must happen before bwmon_clear_irq() to avoid spurious
+	 * IRQ.
+	 */
+	writel(0x0, bwmon->base + BWMON_ENABLE);
+}
+
+static void bwmon_enable(struct icc_bwmon *bwmon, unsigned int irq_enable)
+{
+	/* Enable interrupts */
+	writel(BWMON_GLOBAL_IRQ_ENABLE_ENABLE,
+	       bwmon->base + BWMON_GLOBAL_IRQ_ENABLE);
+	writel(irq_enable, bwmon->base + BWMON_IRQ_ENABLE);
+
+	/* Enable bwmon */
+	writel(BWMON_ENABLE_ENABLE, bwmon->base + BWMON_ENABLE);
+}
+
+static unsigned int bwmon_kbps_to_count(unsigned int kbps)
+{
+	return kbps / BWMON_COUNT_UNIT_KB;
+}
+
+static void bwmon_set_threshold(struct icc_bwmon *bwmon, unsigned int reg,
+				unsigned int kbps)
+{
+	unsigned int thres;
+
+	thres = mult_frac(bwmon_kbps_to_count(kbps), bwmon->sample_ms,
+			  MSEC_PER_SEC);
+	writel_relaxed(thres, bwmon->base + reg);
+}
+
+static void bwmon_start(struct icc_bwmon *bwmon,
+			const struct icc_bwmon_data *data)
+{
+	unsigned int thres_count;
+	int window;
+
+	bwmon_clear_counters(bwmon);
+
+	window = mult_frac(bwmon->sample_ms, HW_TIMER_HZ, MSEC_PER_SEC);
+	/* Maximum sampling window: 0xfffff */
+	writel_relaxed(window, bwmon->base + BWMON_SAMPLE_WINDOW);
+
+	bwmon_set_threshold(bwmon, BWMON_THRESHOLD_HIGH,
+			    data->default_highbw_kbps);
+	bwmon_set_threshold(bwmon, BWMON_THRESHOLD_MED,
+			    data->default_medbw_kbps);
+	bwmon_set_threshold(bwmon, BWMON_THRESHOLD_LOW,
+			    data->default_lowbw_kbps);
+
+	thres_count = data->zone3_thres_count << BWMON_THRESHOLD_COUNT_ZONE3_SHIFT |
+		      BWMON_THRESHOLD_COUNT_ZONE2_DEFAULT << BWMON_THRESHOLD_COUNT_ZONE2_SHIFT |
+		      data->zone1_thres_count << BWMON_THRESHOLD_COUNT_ZONE1_SHIFT |
+		      BWMON_THRESHOLD_COUNT_ZONE0_DEFAULT;
+	writel_relaxed(thres_count, bwmon->base + BWMON_THRESHOLD_COUNT);
+	writel_relaxed(BWMON_ZONE_ACTIONS_DEFAULT,
+		       bwmon->base + BWMON_ZONE_ACTIONS);
+	/* Write barriers in bwmon_clear_irq() */
+
+	bwmon_clear_irq(bwmon);
+	bwmon_enable(bwmon, BWMON_IRQ_ENABLE_MASK);
+}
+
+static irqreturn_t bwmon_intr(int irq, void *dev_id)
+{
+	struct icc_bwmon *bwmon = dev_id;
+	unsigned int status, max;
+	int zone;
+
+	status = readl(bwmon->base + BWMON_IRQ_STATUS);
+	status &= BWMON_IRQ_ENABLE_MASK;
+	if (!status) {
+		/*
+		 * Only zone 1 and zone 3 interrupts are enabled but zone 2
+		 * threshold could be hit and trigger interrupt even if not
+		 * enabled.
+		 * Such spurious interrupt might come with valuable max count or
+		 * not, so solution would be to always check all
+		 * BWMON_ZONE_MAX() registers to find the highest value.
+		 * Such case is currently ignored.
+		 */
+		return IRQ_NONE;
+	}
+
+	bwmon_disable(bwmon);
+
+	zone = get_bitmask_order(status >> BWMON_IRQ_STATUS_ZONE_SHIFT) - 1;
+	/*
+	 * Zone max bytes count register returns count units within sampling
+	 * window.  Downstream kernel for BWMONv4 (called BWMON type 2 in
+	 * downstream) always increments the max bytes count by one.
+	 */
+	max = readl(bwmon->base + BWMON_ZONE_MAX(zone)) + 1;
+	max *= BWMON_COUNT_UNIT_KB;
+	bwmon->target_kbps = mult_frac(max, MSEC_PER_SEC, bwmon->sample_ms);
+
+	return IRQ_WAKE_THREAD;
+}
+
+static irqreturn_t bwmon_intr_thread(int irq, void *dev_id)
+{
+	struct icc_bwmon *bwmon = dev_id;
+	unsigned int irq_enable = 0;
+	struct dev_pm_opp *opp, *target_opp;
+	unsigned int bw_kbps, up_kbps, down_kbps;
+
+	bw_kbps = bwmon->target_kbps;
+
+	target_opp = dev_pm_opp_find_bw_ceil(bwmon->dev, &bw_kbps, 0);
+	if (IS_ERR(target_opp) && PTR_ERR(target_opp) == -ERANGE)
+		target_opp = dev_pm_opp_find_bw_floor(bwmon->dev, &bw_kbps, 0);
+
+	bwmon->target_kbps = bw_kbps;
+
+	bw_kbps--;
+	opp = dev_pm_opp_find_bw_floor(bwmon->dev, &bw_kbps, 0);
+	if (IS_ERR(opp) && PTR_ERR(opp) == -ERANGE)
+		down_kbps = bwmon->target_kbps;
+	else
+		down_kbps = bw_kbps;
+
+	up_kbps = bwmon->target_kbps + 1;
+
+	if (bwmon->target_kbps >= bwmon->max_bw_kbps)
+		irq_enable = BIT(BWMON_IRQ_ENABLE_ZONE1_SHIFT);
+	else if (bwmon->target_kbps <= bwmon->min_bw_kbps)
+		irq_enable = BIT(BWMON_IRQ_ENABLE_ZONE3_SHIFT);
+	else
+		irq_enable = BWMON_IRQ_ENABLE_MASK;
+
+	bwmon_set_threshold(bwmon, BWMON_THRESHOLD_HIGH, up_kbps);
+	bwmon_set_threshold(bwmon, BWMON_THRESHOLD_MED, down_kbps);
+	/* Write barriers in bwmon_clear_counters() */
+	bwmon_clear_counters(bwmon);
+	bwmon_clear_irq(bwmon);
+	bwmon_enable(bwmon, irq_enable);
+
+	if (bwmon->target_kbps == bwmon->current_kbps)
+		goto out;
+
+	dev_pm_opp_set_opp(bwmon->dev, target_opp);
+	bwmon->current_kbps = bwmon->target_kbps;
+
+out:
+	dev_pm_opp_put(target_opp);
+	if (!IS_ERR(opp))
+		dev_pm_opp_put(opp);
+
+	return IRQ_HANDLED;
+}
+
+static int bwmon_probe(struct platform_device *pdev)
+{
+	struct device *dev = &pdev->dev;
+	struct dev_pm_opp *opp;
+	struct icc_bwmon *bwmon;
+	const struct icc_bwmon_data *data;
+	int ret;
+
+	bwmon = devm_kzalloc(dev, sizeof(*bwmon), GFP_KERNEL);
+	if (!bwmon)
+		return -ENOMEM;
+
+	data = of_device_get_match_data(dev);
+
+	bwmon->base = devm_platform_ioremap_resource(pdev, 0);
+	if (IS_ERR(bwmon->base)) {
+		dev_err(dev, "failed to map bwmon registers\n");
+		return PTR_ERR(bwmon->base);
+	}
+
+	bwmon->irq = platform_get_irq(pdev, 0);
+	if (bwmon->irq < 0) {
+		dev_err(dev, "failed to acquire bwmon IRQ\n");
+		return bwmon->irq;
+	}
+
+	ret = devm_pm_opp_of_add_table(dev);
+	if (ret)
+		return dev_err_probe(dev, ret, "failed to add OPP table\n");
+
+	bwmon->max_bw_kbps = UINT_MAX;
+	opp = dev_pm_opp_find_bw_floor(dev, &bwmon->max_bw_kbps, 0);
+	if (IS_ERR(opp))
+		return dev_err_probe(dev, ret, "failed to find max peak bandwidth\n");
+
+	bwmon->min_bw_kbps = 0;
+	opp = dev_pm_opp_find_bw_ceil(dev, &bwmon->min_bw_kbps, 0);
+	if (IS_ERR(opp))
+		return dev_err_probe(dev, ret, "failed to find min peak bandwidth\n");
+
+	bwmon->sample_ms = data->sample_ms;
+	bwmon->default_lowbw_kbps = data->default_lowbw_kbps;
+	bwmon->dev = dev;
+
+	bwmon_disable(bwmon);
+	ret = devm_request_threaded_irq(dev, bwmon->irq, bwmon_intr,
+					bwmon_intr_thread,
+					IRQF_ONESHOT, dev_name(dev), bwmon);
+	if (ret)
+		return dev_err_probe(dev, ret, "failed to request IRQ\n");
+
+	platform_set_drvdata(pdev, bwmon);
+	bwmon_start(bwmon, data);
+
+	return 0;
+}
+
+static int bwmon_remove(struct platform_device *pdev)
+{
+	struct icc_bwmon *bwmon = platform_get_drvdata(pdev);
+
+	bwmon_disable(bwmon);
+
+	return 0;
+}
+
+/* BWMON v4 */
+static const struct icc_bwmon_data sdm845_bwmon_data = {
+	.sample_ms = 4,
+	.default_highbw_kbps = 4800 * 1024, /* 4.8 GBps */
+	.default_medbw_kbps = 512 * 1024, /* 512 MBps */
+	.default_lowbw_kbps = 0,
+	.zone1_thres_count = 16,
+	.zone3_thres_count = 1,
+};
+
+static const struct of_device_id bwmon_of_match[] = {
+	{ .compatible = "qcom,sdm845-cpu-bwmon", .data = &sdm845_bwmon_data },
+	{}
+};
+MODULE_DEVICE_TABLE(of, bwmon_of_match);
+
+static struct platform_driver bwmon_driver = {
+	.probe = bwmon_probe,
+	.remove = bwmon_remove,
+	.driver = {
+		.name = "qcom-bwmon",
+		.of_match_table = bwmon_of_match,
+	},
+};
+module_platform_driver(bwmon_driver);
+
+MODULE_AUTHOR("Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>");
+MODULE_DESCRIPTION("QCOM BWMON driver");
+MODULE_LICENSE("GPL");
-- 
2.34.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v4 3/4] arm64: defconfig: enable Qualcomm Bandwidth Monitor
  2022-06-01 10:11 ` Krzysztof Kozlowski
@ 2022-06-01 10:11   ` Krzysztof Kozlowski
  -1 siblings, 0 replies; 52+ messages in thread
From: Krzysztof Kozlowski @ 2022-06-01 10:11 UTC (permalink / raw)
  To: Andy Gross, Bjorn Andersson, Krzysztof Kozlowski, Georgi Djakov,
	Rob Herring, Catalin Marinas, Will Deacon, linux-arm-msm,
	linux-pm, devicetree, linux-kernel, linux-arm-kernel

Enable the Qualcomm Bandwidth Monitor to allow scaling interconnects
depending on bandwidth usage between CPU and memory.  This is used
already on Qualcomm SDM845 SoC.

Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
---
 arch/arm64/configs/defconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm64/configs/defconfig b/arch/arm64/configs/defconfig
index 6906b83f5e45..6edbcfd3f4ca 100644
--- a/arch/arm64/configs/defconfig
+++ b/arch/arm64/configs/defconfig
@@ -1096,6 +1096,7 @@ CONFIG_QCOM_SOCINFO=m
 CONFIG_QCOM_STATS=m
 CONFIG_QCOM_WCNSS_CTRL=m
 CONFIG_QCOM_APR=m
+CONFIG_QCOM_ICC_BWMON=m
 CONFIG_ARCH_R8A77995=y
 CONFIG_ARCH_R8A77990=y
 CONFIG_ARCH_R8A77950=y
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v4 3/4] arm64: defconfig: enable Qualcomm Bandwidth Monitor
@ 2022-06-01 10:11   ` Krzysztof Kozlowski
  0 siblings, 0 replies; 52+ messages in thread
From: Krzysztof Kozlowski @ 2022-06-01 10:11 UTC (permalink / raw)
  To: Andy Gross, Bjorn Andersson, Krzysztof Kozlowski, Georgi Djakov,
	Rob Herring, Catalin Marinas, Will Deacon, linux-arm-msm,
	linux-pm, devicetree, linux-kernel, linux-arm-kernel

Enable the Qualcomm Bandwidth Monitor to allow scaling interconnects
depending on bandwidth usage between CPU and memory.  This is used
already on Qualcomm SDM845 SoC.

Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
---
 arch/arm64/configs/defconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm64/configs/defconfig b/arch/arm64/configs/defconfig
index 6906b83f5e45..6edbcfd3f4ca 100644
--- a/arch/arm64/configs/defconfig
+++ b/arch/arm64/configs/defconfig
@@ -1096,6 +1096,7 @@ CONFIG_QCOM_SOCINFO=m
 CONFIG_QCOM_STATS=m
 CONFIG_QCOM_WCNSS_CTRL=m
 CONFIG_QCOM_APR=m
+CONFIG_QCOM_ICC_BWMON=m
 CONFIG_ARCH_R8A77995=y
 CONFIG_ARCH_R8A77990=y
 CONFIG_ARCH_R8A77950=y
-- 
2.34.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v4 4/4] arm64: dts: qcom: sdm845: Add CPU BWMON
  2022-06-01 10:11 ` Krzysztof Kozlowski
@ 2022-06-01 10:11   ` Krzysztof Kozlowski
  -1 siblings, 0 replies; 52+ messages in thread
From: Krzysztof Kozlowski @ 2022-06-01 10:11 UTC (permalink / raw)
  To: Andy Gross, Bjorn Andersson, Krzysztof Kozlowski, Georgi Djakov,
	Rob Herring, Catalin Marinas, Will Deacon, linux-arm-msm,
	linux-pm, devicetree, linux-kernel, linux-arm-kernel
  Cc: Thara Gopinath

Add device node for CPU-memory BWMON device (bandwidth monitoring) on
SDM845 measuring bandwidth between CPU (gladiator_noc) and Last Level
Cache (memnoc).  Usage of this BWMON allows to remove fixed bandwidth
votes from cpufreq (CPU nodes) thus achieve high memory throughput even
with lower CPU frequencies.

Performance impact (SDM845-MTP RB3 board, linux next-20220422):
1. No noticeable impact when running with schedutil or performance
   governors.

2. When comparing to customized kernel with synced interconnects and
   without bandwidth votes from CPU freq, the sysbench memory tests
   show significant improvement with bwmon for blocksizes past the L3
   cache.  The results for such superficial comparison:

sysbench memory test, results in MB/s (higher is better)
 bs kB |  type |    V  | V+no bw votes | bwmon | benefit %
     1 | W/seq | 14795 |          4816 |  4985 |      3.5%
    64 | W/seq | 41987 |         10334 | 10433 |      1.0%
  4096 | W/seq | 29768 |          8728 | 32007 |    266.7%
 65536 | W/seq | 17711 |          4846 | 18399 |    279.6%
262144 | W/seq | 16112 |          4538 | 17429 |    284.1%
    64 | R/seq | 61202 |         67092 | 66804 |     -0.4%
  4096 | R/seq | 23871 |          5458 | 24307 |    345.4%
 65536 | R/seq | 18554 |          4240 | 18685 |    340.7%
262144 | R/seq | 17524 |          4207 | 17774 |    322.4%
    64 | W/rnd |  2663 |          1098 |  1119 |      1.9%
 65536 | W/rnd |   600 |           316 |   610 |     92.7%
    64 | R/rnd |  4915 |          4784 |  4594 |     -4.0%
 65536 | R/rnd |   664 |           281 |   678 |    140.7%

Legend:
bs kB: block size in KB (small block size means only L1-3 caches are
      used
type: R - read, W - write, seq - sequential, rnd - random
V: vanilla (next-20220422)
V + no bw votes: vanilla without bandwidth votes from CPU freq
bwmon: bwmon without bandwidth votes from CPU freq
benefit %: difference between vanilla without bandwidth votes and bwmon
           (higher is better)

Co-developed-by: Thara Gopinath <thara.gopinath@linaro.org>
Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
---
 arch/arm64/boot/dts/qcom/sdm845.dtsi | 54 ++++++++++++++++++++++++++++
 1 file changed, 54 insertions(+)

diff --git a/arch/arm64/boot/dts/qcom/sdm845.dtsi b/arch/arm64/boot/dts/qcom/sdm845.dtsi
index 83e8b63f0910..adffb9c70566 100644
--- a/arch/arm64/boot/dts/qcom/sdm845.dtsi
+++ b/arch/arm64/boot/dts/qcom/sdm845.dtsi
@@ -2026,6 +2026,60 @@ llcc: system-cache-controller@1100000 {
 			interrupts = <GIC_SPI 582 IRQ_TYPE_LEVEL_HIGH>;
 		};
 
+		pmu@1436400 {
+			compatible = "qcom,sdm845-cpu-bwmon";
+			reg = <0 0x01436400 0 0x600>;
+
+			interrupts = <GIC_SPI 581 IRQ_TYPE_LEVEL_HIGH>;
+
+			interconnects = <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>,
+					<&osm_l3 MASTER_OSM_L3_APPS &osm_l3 SLAVE_OSM_L3>;
+			interconnect-names = "ddr", "l3c";
+
+			operating-points-v2 = <&cpu_bwmon_opp_table>;
+
+			cpu_bwmon_opp_table: opp-table {
+				compatible = "operating-points-v2";
+
+				/*
+				 * The interconnect paths bandwidths taken from
+				 * cpu4_opp_table bandwidth.
+				 * They also match different tables from
+				 * msm-4.9 downstream kernel:
+				 *  - the gladiator_noc-mem_noc from bandwidth
+				 *    table of qcom,llccbw (property qcom,bw-tbl);
+				 *    bus width: 4 bytes;
+				 *  - the OSM L3 from bandwidth table of
+				 *    qcom,cpu4-l3lat-mon (qcom,core-dev-table);
+				 *    bus width: 16 bytes;
+				 */
+				opp-0 {
+					opp-peak-kBps = <800000 4800000>;
+				};
+				opp-1 {
+					opp-peak-kBps = <1804000 9216000>;
+				};
+				opp-2 {
+					opp-peak-kBps = <2188000 11980800>;
+				};
+				opp-3 {
+					opp-peak-kBps = <3072000 15052800>;
+				};
+				opp-4 {
+					opp-peak-kBps = <4068000 19353600>;
+				};
+				opp-5 {
+					opp-peak-kBps = <5412000 20889600>;
+				};
+				opp-6 {
+					opp-peak-kBps = <6220000 22425600>;
+				};
+				opp-7 {
+					opp-peak-kBps = <7216000 25497600>;
+				};
+			};
+		};
+
 		pcie0: pci@1c00000 {
 			compatible = "qcom,pcie-sdm845";
 			reg = <0 0x01c00000 0 0x2000>,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v4 4/4] arm64: dts: qcom: sdm845: Add CPU BWMON
@ 2022-06-01 10:11   ` Krzysztof Kozlowski
  0 siblings, 0 replies; 52+ messages in thread
From: Krzysztof Kozlowski @ 2022-06-01 10:11 UTC (permalink / raw)
  To: Andy Gross, Bjorn Andersson, Krzysztof Kozlowski, Georgi Djakov,
	Rob Herring, Catalin Marinas, Will Deacon, linux-arm-msm,
	linux-pm, devicetree, linux-kernel, linux-arm-kernel
  Cc: Thara Gopinath

Add device node for CPU-memory BWMON device (bandwidth monitoring) on
SDM845 measuring bandwidth between CPU (gladiator_noc) and Last Level
Cache (memnoc).  Usage of this BWMON allows to remove fixed bandwidth
votes from cpufreq (CPU nodes) thus achieve high memory throughput even
with lower CPU frequencies.

Performance impact (SDM845-MTP RB3 board, linux next-20220422):
1. No noticeable impact when running with schedutil or performance
   governors.

2. When comparing to customized kernel with synced interconnects and
   without bandwidth votes from CPU freq, the sysbench memory tests
   show significant improvement with bwmon for blocksizes past the L3
   cache.  The results for such superficial comparison:

sysbench memory test, results in MB/s (higher is better)
 bs kB |  type |    V  | V+no bw votes | bwmon | benefit %
     1 | W/seq | 14795 |          4816 |  4985 |      3.5%
    64 | W/seq | 41987 |         10334 | 10433 |      1.0%
  4096 | W/seq | 29768 |          8728 | 32007 |    266.7%
 65536 | W/seq | 17711 |          4846 | 18399 |    279.6%
262144 | W/seq | 16112 |          4538 | 17429 |    284.1%
    64 | R/seq | 61202 |         67092 | 66804 |     -0.4%
  4096 | R/seq | 23871 |          5458 | 24307 |    345.4%
 65536 | R/seq | 18554 |          4240 | 18685 |    340.7%
262144 | R/seq | 17524 |          4207 | 17774 |    322.4%
    64 | W/rnd |  2663 |          1098 |  1119 |      1.9%
 65536 | W/rnd |   600 |           316 |   610 |     92.7%
    64 | R/rnd |  4915 |          4784 |  4594 |     -4.0%
 65536 | R/rnd |   664 |           281 |   678 |    140.7%

Legend:
bs kB: block size in KB (small block size means only L1-3 caches are
      used
type: R - read, W - write, seq - sequential, rnd - random
V: vanilla (next-20220422)
V + no bw votes: vanilla without bandwidth votes from CPU freq
bwmon: bwmon without bandwidth votes from CPU freq
benefit %: difference between vanilla without bandwidth votes and bwmon
           (higher is better)

Co-developed-by: Thara Gopinath <thara.gopinath@linaro.org>
Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
---
 arch/arm64/boot/dts/qcom/sdm845.dtsi | 54 ++++++++++++++++++++++++++++
 1 file changed, 54 insertions(+)

diff --git a/arch/arm64/boot/dts/qcom/sdm845.dtsi b/arch/arm64/boot/dts/qcom/sdm845.dtsi
index 83e8b63f0910..adffb9c70566 100644
--- a/arch/arm64/boot/dts/qcom/sdm845.dtsi
+++ b/arch/arm64/boot/dts/qcom/sdm845.dtsi
@@ -2026,6 +2026,60 @@ llcc: system-cache-controller@1100000 {
 			interrupts = <GIC_SPI 582 IRQ_TYPE_LEVEL_HIGH>;
 		};
 
+		pmu@1436400 {
+			compatible = "qcom,sdm845-cpu-bwmon";
+			reg = <0 0x01436400 0 0x600>;
+
+			interrupts = <GIC_SPI 581 IRQ_TYPE_LEVEL_HIGH>;
+
+			interconnects = <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>,
+					<&osm_l3 MASTER_OSM_L3_APPS &osm_l3 SLAVE_OSM_L3>;
+			interconnect-names = "ddr", "l3c";
+
+			operating-points-v2 = <&cpu_bwmon_opp_table>;
+
+			cpu_bwmon_opp_table: opp-table {
+				compatible = "operating-points-v2";
+
+				/*
+				 * The interconnect paths bandwidths taken from
+				 * cpu4_opp_table bandwidth.
+				 * They also match different tables from
+				 * msm-4.9 downstream kernel:
+				 *  - the gladiator_noc-mem_noc from bandwidth
+				 *    table of qcom,llccbw (property qcom,bw-tbl);
+				 *    bus width: 4 bytes;
+				 *  - the OSM L3 from bandwidth table of
+				 *    qcom,cpu4-l3lat-mon (qcom,core-dev-table);
+				 *    bus width: 16 bytes;
+				 */
+				opp-0 {
+					opp-peak-kBps = <800000 4800000>;
+				};
+				opp-1 {
+					opp-peak-kBps = <1804000 9216000>;
+				};
+				opp-2 {
+					opp-peak-kBps = <2188000 11980800>;
+				};
+				opp-3 {
+					opp-peak-kBps = <3072000 15052800>;
+				};
+				opp-4 {
+					opp-peak-kBps = <4068000 19353600>;
+				};
+				opp-5 {
+					opp-peak-kBps = <5412000 20889600>;
+				};
+				opp-6 {
+					opp-peak-kBps = <6220000 22425600>;
+				};
+				opp-7 {
+					opp-peak-kBps = <7216000 25497600>;
+				};
+			};
+		};
+
 		pcie0: pci@1c00000 {
 			compatible = "qcom,pcie-sdm845";
 			reg = <0 0x01c00000 0 0x2000>,
-- 
2.34.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 2/4] soc: qcom: icc-bwmon: Add bandwidth monitoring driver
  2022-06-01 10:11   ` Krzysztof Kozlowski
@ 2022-06-06 16:35     ` Georgi Djakov
  -1 siblings, 0 replies; 52+ messages in thread
From: Georgi Djakov @ 2022-06-06 16:35 UTC (permalink / raw)
  To: Krzysztof Kozlowski, Andy Gross, Bjorn Andersson, Rob Herring,
	Catalin Marinas, Will Deacon, linux-arm-msm, linux-pm,
	devicetree, linux-kernel, linux-arm-kernel
  Cc: Thara Gopinath

Hi Krzysztof,

Thanks for working on this!

On 1.06.22 13:11, Krzysztof Kozlowski wrote:
> Bandwidth monitoring (BWMON) sits between various subsytems like CPU,
> GPU, Last Level caches and memory subsystem.  The BWMON can be
> configured to monitor the data throuhput between memory and other
> subsytems.  The throughput is measured within specified sampling window
> and is used to vote for corresponding interconnect bandwidth.
> 
> Current implementation brings support for BWMON v4, used for example on
> SDM845 to measure bandwidth between CPU (gladiator_noc) and Last Level
> Cache (memnoc).  Usage of this BWMON allows to remove fixed bandwidth
> votes from cpufreq (CPU nodes) thus achieve high memory throughput even
> with lower CPU frequencies.

I am curious if you ran any tests - e.g set the CPU to some fixed
frequency and run memory throughput benchmarks with/without this
driver? Could you share any data?

> Co-developed-by: Thara Gopinath <thara.gopinath@linaro.org>
> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
> Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
> ---
>   MAINTAINERS                  |   7 +
>   drivers/soc/qcom/Kconfig     |  15 ++
>   drivers/soc/qcom/Makefile    |   1 +
>   drivers/soc/qcom/icc-bwmon.c | 421 +++++++++++++++++++++++++++++++++++
>   4 files changed, 444 insertions(+)
>   create mode 100644 drivers/soc/qcom/icc-bwmon.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 6157e706ed02..bc123f706256 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -16376,6 +16376,13 @@ S:	Maintained
>   F:	Documentation/devicetree/bindings/i2c/i2c-qcom-cci.txt
>   F:	drivers/i2c/busses/i2c-qcom-cci.c
>   
> +QUALCOMM INTERCONNECT BWMON DRIVER
> +M:	Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
> +L:	linux-arm-msm@vger.kernel.org
> +S:	Maintained
> +F:	Documentation/devicetree/bindings/interconnect/qcom,sdm845-cpu-bwmon.yaml
> +F:	drivers/soc/qcom/icc-bwmon.c
> +
>   QUALCOMM IOMMU
>   M:	Rob Clark <robdclark@gmail.com>
>   L:	iommu@lists.linux-foundation.org
> diff --git a/drivers/soc/qcom/Kconfig b/drivers/soc/qcom/Kconfig
> index e718b8735444..35c5192dcfc7 100644
> --- a/drivers/soc/qcom/Kconfig
> +++ b/drivers/soc/qcom/Kconfig
> @@ -228,4 +228,19 @@ config QCOM_APR
>   	  application processor and QDSP6. APR is
>   	  used by audio driver to configure QDSP6
>   	  ASM, ADM and AFE modules.
> +
> +config QCOM_ICC_BWMON
> +	tristate "QCOM Interconnect Bandwidth Monitor driver"
> +	depends on ARCH_QCOM || COMPILE_TEST
> +	select PM_OPP
> +	help
> +	  Sets up driver monitoring bandwidth on various interconnects and
> +	  based on that voting for interconnect bandwidth, adjusting their
> +	  speed to current demand.
> +	  Current implementation brings support for BWMON v4, used for example
> +	  on SDM845 to measure bandwidth between CPU (gladiator_noc) and Last
> +	  Level Cache (memnoc).  Usage of this BWMON allows to remove fixed
> +	  bandwidth votes from cpufreq (CPU nodes) thus achieve high memory
> +	  throughput even with lower CPU frequencies.
> +
>   endmenu
> diff --git a/drivers/soc/qcom/Makefile b/drivers/soc/qcom/Makefile
> index 70d5de69fd7b..d66604aff2b0 100644
> --- a/drivers/soc/qcom/Makefile
> +++ b/drivers/soc/qcom/Makefile
> @@ -28,3 +28,4 @@ obj-$(CONFIG_QCOM_LLCC) += llcc-qcom.o
>   obj-$(CONFIG_QCOM_RPMHPD) += rpmhpd.o
>   obj-$(CONFIG_QCOM_RPMPD) += rpmpd.o
>   obj-$(CONFIG_QCOM_KRYO_L2_ACCESSORS) +=	kryo-l2-accessors.o
> +obj-$(CONFIG_QCOM_ICC_BWMON)	+= icc-bwmon.o
> diff --git a/drivers/soc/qcom/icc-bwmon.c b/drivers/soc/qcom/icc-bwmon.c
> new file mode 100644
> index 000000000000..1eed075545db
> --- /dev/null
> +++ b/drivers/soc/qcom/icc-bwmon.c
> @@ -0,0 +1,421 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (c) 2014-2018, The Linux Foundation. All rights reserved.
> + * Copyright (C) 2021-2022 Linaro Ltd
> + * Author: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>, based on
> + *         previous work of Thara Gopinath and msm-4.9 downstream sources.
> + */
> +#include <linux/interconnect.h>

Is this used?

> +#include <linux/interrupt.h>
> +#include <linux/io.h>
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/of_device.h>
> +#include <linux/platform_device.h>
> +#include <linux/pm_opp.h>
> +#include <linux/sizes.h>

Ditto.

Thanks,
Georgi

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 2/4] soc: qcom: icc-bwmon: Add bandwidth monitoring driver
@ 2022-06-06 16:35     ` Georgi Djakov
  0 siblings, 0 replies; 52+ messages in thread
From: Georgi Djakov @ 2022-06-06 16:35 UTC (permalink / raw)
  To: Krzysztof Kozlowski, Andy Gross, Bjorn Andersson, Rob Herring,
	Catalin Marinas, Will Deacon, linux-arm-msm, linux-pm,
	devicetree, linux-kernel, linux-arm-kernel
  Cc: Thara Gopinath

Hi Krzysztof,

Thanks for working on this!

On 1.06.22 13:11, Krzysztof Kozlowski wrote:
> Bandwidth monitoring (BWMON) sits between various subsytems like CPU,
> GPU, Last Level caches and memory subsystem.  The BWMON can be
> configured to monitor the data throuhput between memory and other
> subsytems.  The throughput is measured within specified sampling window
> and is used to vote for corresponding interconnect bandwidth.
> 
> Current implementation brings support for BWMON v4, used for example on
> SDM845 to measure bandwidth between CPU (gladiator_noc) and Last Level
> Cache (memnoc).  Usage of this BWMON allows to remove fixed bandwidth
> votes from cpufreq (CPU nodes) thus achieve high memory throughput even
> with lower CPU frequencies.

I am curious if you ran any tests - e.g set the CPU to some fixed
frequency and run memory throughput benchmarks with/without this
driver? Could you share any data?

> Co-developed-by: Thara Gopinath <thara.gopinath@linaro.org>
> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
> Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
> ---
>   MAINTAINERS                  |   7 +
>   drivers/soc/qcom/Kconfig     |  15 ++
>   drivers/soc/qcom/Makefile    |   1 +
>   drivers/soc/qcom/icc-bwmon.c | 421 +++++++++++++++++++++++++++++++++++
>   4 files changed, 444 insertions(+)
>   create mode 100644 drivers/soc/qcom/icc-bwmon.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 6157e706ed02..bc123f706256 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -16376,6 +16376,13 @@ S:	Maintained
>   F:	Documentation/devicetree/bindings/i2c/i2c-qcom-cci.txt
>   F:	drivers/i2c/busses/i2c-qcom-cci.c
>   
> +QUALCOMM INTERCONNECT BWMON DRIVER
> +M:	Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
> +L:	linux-arm-msm@vger.kernel.org
> +S:	Maintained
> +F:	Documentation/devicetree/bindings/interconnect/qcom,sdm845-cpu-bwmon.yaml
> +F:	drivers/soc/qcom/icc-bwmon.c
> +
>   QUALCOMM IOMMU
>   M:	Rob Clark <robdclark@gmail.com>
>   L:	iommu@lists.linux-foundation.org
> diff --git a/drivers/soc/qcom/Kconfig b/drivers/soc/qcom/Kconfig
> index e718b8735444..35c5192dcfc7 100644
> --- a/drivers/soc/qcom/Kconfig
> +++ b/drivers/soc/qcom/Kconfig
> @@ -228,4 +228,19 @@ config QCOM_APR
>   	  application processor and QDSP6. APR is
>   	  used by audio driver to configure QDSP6
>   	  ASM, ADM and AFE modules.
> +
> +config QCOM_ICC_BWMON
> +	tristate "QCOM Interconnect Bandwidth Monitor driver"
> +	depends on ARCH_QCOM || COMPILE_TEST
> +	select PM_OPP
> +	help
> +	  Sets up driver monitoring bandwidth on various interconnects and
> +	  based on that voting for interconnect bandwidth, adjusting their
> +	  speed to current demand.
> +	  Current implementation brings support for BWMON v4, used for example
> +	  on SDM845 to measure bandwidth between CPU (gladiator_noc) and Last
> +	  Level Cache (memnoc).  Usage of this BWMON allows to remove fixed
> +	  bandwidth votes from cpufreq (CPU nodes) thus achieve high memory
> +	  throughput even with lower CPU frequencies.
> +
>   endmenu
> diff --git a/drivers/soc/qcom/Makefile b/drivers/soc/qcom/Makefile
> index 70d5de69fd7b..d66604aff2b0 100644
> --- a/drivers/soc/qcom/Makefile
> +++ b/drivers/soc/qcom/Makefile
> @@ -28,3 +28,4 @@ obj-$(CONFIG_QCOM_LLCC) += llcc-qcom.o
>   obj-$(CONFIG_QCOM_RPMHPD) += rpmhpd.o
>   obj-$(CONFIG_QCOM_RPMPD) += rpmpd.o
>   obj-$(CONFIG_QCOM_KRYO_L2_ACCESSORS) +=	kryo-l2-accessors.o
> +obj-$(CONFIG_QCOM_ICC_BWMON)	+= icc-bwmon.o
> diff --git a/drivers/soc/qcom/icc-bwmon.c b/drivers/soc/qcom/icc-bwmon.c
> new file mode 100644
> index 000000000000..1eed075545db
> --- /dev/null
> +++ b/drivers/soc/qcom/icc-bwmon.c
> @@ -0,0 +1,421 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (c) 2014-2018, The Linux Foundation. All rights reserved.
> + * Copyright (C) 2021-2022 Linaro Ltd
> + * Author: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>, based on
> + *         previous work of Thara Gopinath and msm-4.9 downstream sources.
> + */
> +#include <linux/interconnect.h>

Is this used?

> +#include <linux/interrupt.h>
> +#include <linux/io.h>
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/of_device.h>
> +#include <linux/platform_device.h>
> +#include <linux/pm_opp.h>
> +#include <linux/sizes.h>

Ditto.

Thanks,
Georgi

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 4/4] arm64: dts: qcom: sdm845: Add CPU BWMON
  2022-06-01 10:11   ` Krzysztof Kozlowski
@ 2022-06-06 20:39     ` Georgi Djakov
  -1 siblings, 0 replies; 52+ messages in thread
From: Georgi Djakov @ 2022-06-06 20:39 UTC (permalink / raw)
  To: Krzysztof Kozlowski, Andy Gross, Bjorn Andersson, Rob Herring,
	Catalin Marinas, Will Deacon, linux-arm-msm, linux-pm,
	devicetree, linux-kernel, linux-arm-kernel
  Cc: Thara Gopinath

On 1.06.22 13:11, Krzysztof Kozlowski wrote:
> Add device node for CPU-memory BWMON device (bandwidth monitoring) on
> SDM845 measuring bandwidth between CPU (gladiator_noc) and Last Level
> Cache (memnoc).  Usage of this BWMON allows to remove fixed bandwidth
> votes from cpufreq (CPU nodes) thus achieve high memory throughput even
> with lower CPU frequencies.
> 
> Performance impact (SDM845-MTP RB3 board, linux next-20220422):
> 1. No noticeable impact when running with schedutil or performance
>     governors.
> 
> 2. When comparing to customized kernel with synced interconnects and
>     without bandwidth votes from CPU freq, the sysbench memory tests
>     show significant improvement with bwmon for blocksizes past the L3
>     cache.  The results for such superficial comparison:
> 
> sysbench memory test, results in MB/s (higher is better)
>   bs kB |  type |    V  | V+no bw votes | bwmon | benefit %
>       1 | W/seq | 14795 |          4816 |  4985 |      3.5%
>      64 | W/seq | 41987 |         10334 | 10433 |      1.0%
>    4096 | W/seq | 29768 |          8728 | 32007 |    266.7%
>   65536 | W/seq | 17711 |          4846 | 18399 |    279.6%
> 262144 | W/seq | 16112 |          4538 | 17429 |    284.1%
>      64 | R/seq | 61202 |         67092 | 66804 |     -0.4%
>    4096 | R/seq | 23871 |          5458 | 24307 |    345.4%
>   65536 | R/seq | 18554 |          4240 | 18685 |    340.7%
> 262144 | R/seq | 17524 |          4207 | 17774 |    322.4%
>      64 | W/rnd |  2663 |          1098 |  1119 |      1.9%
>   65536 | W/rnd |   600 |           316 |   610 |     92.7%
>      64 | R/rnd |  4915 |          4784 |  4594 |     -4.0%
>   65536 | R/rnd |   664 |           281 |   678 |    140.7%
> 
> Legend:
> bs kB: block size in KB (small block size means only L1-3 caches are
>        used
> type: R - read, W - write, seq - sequential, rnd - random
> V: vanilla (next-20220422)
> V + no bw votes: vanilla without bandwidth votes from CPU freq
> bwmon: bwmon without bandwidth votes from CPU freq
> benefit %: difference between vanilla without bandwidth votes and bwmon
>             (higher is better)
> 

Ok, now i see! So bwmon shows similar performance compared with the current
cpufreq-based bandwidth scaling. And if you add bwmon on top of vanilla, are
the results close/same? Is the plan to remove the cpufreq based bandwidth
scaling and switch to bwmon? It might improve the power consumption in some
scenarios.

Thanks,
Georgi

> Co-developed-by: Thara Gopinath <thara.gopinath@linaro.org>
> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
> Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
> ---
>   arch/arm64/boot/dts/qcom/sdm845.dtsi | 54 ++++++++++++++++++++++++++++
>   1 file changed, 54 insertions(+)
> 
> diff --git a/arch/arm64/boot/dts/qcom/sdm845.dtsi b/arch/arm64/boot/dts/qcom/sdm845.dtsi
> index 83e8b63f0910..adffb9c70566 100644
> --- a/arch/arm64/boot/dts/qcom/sdm845.dtsi
> +++ b/arch/arm64/boot/dts/qcom/sdm845.dtsi
> @@ -2026,6 +2026,60 @@ llcc: system-cache-controller@1100000 {
>   			interrupts = <GIC_SPI 582 IRQ_TYPE_LEVEL_HIGH>;
>   		};
>   
> +		pmu@1436400 {
> +			compatible = "qcom,sdm845-cpu-bwmon";
> +			reg = <0 0x01436400 0 0x600>;
> +
> +			interrupts = <GIC_SPI 581 IRQ_TYPE_LEVEL_HIGH>;
> +
> +			interconnects = <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>,
> +					<&osm_l3 MASTER_OSM_L3_APPS &osm_l3 SLAVE_OSM_L3>;
> +			interconnect-names = "ddr", "l3c";
> +
> +			operating-points-v2 = <&cpu_bwmon_opp_table>;
> +
> +			cpu_bwmon_opp_table: opp-table {
> +				compatible = "operating-points-v2";
> +
> +				/*
> +				 * The interconnect paths bandwidths taken from
> +				 * cpu4_opp_table bandwidth.
> +				 * They also match different tables from
> +				 * msm-4.9 downstream kernel:
> +				 *  - the gladiator_noc-mem_noc from bandwidth
> +				 *    table of qcom,llccbw (property qcom,bw-tbl);
> +				 *    bus width: 4 bytes;
> +				 *  - the OSM L3 from bandwidth table of
> +				 *    qcom,cpu4-l3lat-mon (qcom,core-dev-table);
> +				 *    bus width: 16 bytes;
> +				 */
> +				opp-0 {
> +					opp-peak-kBps = <800000 4800000>;
> +				};
> +				opp-1 {
> +					opp-peak-kBps = <1804000 9216000>;
> +				};
> +				opp-2 {
> +					opp-peak-kBps = <2188000 11980800>;
> +				};
> +				opp-3 {
> +					opp-peak-kBps = <3072000 15052800>;
> +				};
> +				opp-4 {
> +					opp-peak-kBps = <4068000 19353600>;
> +				};
> +				opp-5 {
> +					opp-peak-kBps = <5412000 20889600>;
> +				};
> +				opp-6 {
> +					opp-peak-kBps = <6220000 22425600>;
> +				};
> +				opp-7 {
> +					opp-peak-kBps = <7216000 25497600>;
> +				};
> +			};
> +		};
> +
>   		pcie0: pci@1c00000 {
>   			compatible = "qcom,pcie-sdm845";
>   			reg = <0 0x01c00000 0 0x2000>,


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 4/4] arm64: dts: qcom: sdm845: Add CPU BWMON
@ 2022-06-06 20:39     ` Georgi Djakov
  0 siblings, 0 replies; 52+ messages in thread
From: Georgi Djakov @ 2022-06-06 20:39 UTC (permalink / raw)
  To: Krzysztof Kozlowski, Andy Gross, Bjorn Andersson, Rob Herring,
	Catalin Marinas, Will Deacon, linux-arm-msm, linux-pm,
	devicetree, linux-kernel, linux-arm-kernel
  Cc: Thara Gopinath

On 1.06.22 13:11, Krzysztof Kozlowski wrote:
> Add device node for CPU-memory BWMON device (bandwidth monitoring) on
> SDM845 measuring bandwidth between CPU (gladiator_noc) and Last Level
> Cache (memnoc).  Usage of this BWMON allows to remove fixed bandwidth
> votes from cpufreq (CPU nodes) thus achieve high memory throughput even
> with lower CPU frequencies.
> 
> Performance impact (SDM845-MTP RB3 board, linux next-20220422):
> 1. No noticeable impact when running with schedutil or performance
>     governors.
> 
> 2. When comparing to customized kernel with synced interconnects and
>     without bandwidth votes from CPU freq, the sysbench memory tests
>     show significant improvement with bwmon for blocksizes past the L3
>     cache.  The results for such superficial comparison:
> 
> sysbench memory test, results in MB/s (higher is better)
>   bs kB |  type |    V  | V+no bw votes | bwmon | benefit %
>       1 | W/seq | 14795 |          4816 |  4985 |      3.5%
>      64 | W/seq | 41987 |         10334 | 10433 |      1.0%
>    4096 | W/seq | 29768 |          8728 | 32007 |    266.7%
>   65536 | W/seq | 17711 |          4846 | 18399 |    279.6%
> 262144 | W/seq | 16112 |          4538 | 17429 |    284.1%
>      64 | R/seq | 61202 |         67092 | 66804 |     -0.4%
>    4096 | R/seq | 23871 |          5458 | 24307 |    345.4%
>   65536 | R/seq | 18554 |          4240 | 18685 |    340.7%
> 262144 | R/seq | 17524 |          4207 | 17774 |    322.4%
>      64 | W/rnd |  2663 |          1098 |  1119 |      1.9%
>   65536 | W/rnd |   600 |           316 |   610 |     92.7%
>      64 | R/rnd |  4915 |          4784 |  4594 |     -4.0%
>   65536 | R/rnd |   664 |           281 |   678 |    140.7%
> 
> Legend:
> bs kB: block size in KB (small block size means only L1-3 caches are
>        used
> type: R - read, W - write, seq - sequential, rnd - random
> V: vanilla (next-20220422)
> V + no bw votes: vanilla without bandwidth votes from CPU freq
> bwmon: bwmon without bandwidth votes from CPU freq
> benefit %: difference between vanilla without bandwidth votes and bwmon
>             (higher is better)
> 

Ok, now i see! So bwmon shows similar performance compared with the current
cpufreq-based bandwidth scaling. And if you add bwmon on top of vanilla, are
the results close/same? Is the plan to remove the cpufreq based bandwidth
scaling and switch to bwmon? It might improve the power consumption in some
scenarios.

Thanks,
Georgi

> Co-developed-by: Thara Gopinath <thara.gopinath@linaro.org>
> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
> Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
> ---
>   arch/arm64/boot/dts/qcom/sdm845.dtsi | 54 ++++++++++++++++++++++++++++
>   1 file changed, 54 insertions(+)
> 
> diff --git a/arch/arm64/boot/dts/qcom/sdm845.dtsi b/arch/arm64/boot/dts/qcom/sdm845.dtsi
> index 83e8b63f0910..adffb9c70566 100644
> --- a/arch/arm64/boot/dts/qcom/sdm845.dtsi
> +++ b/arch/arm64/boot/dts/qcom/sdm845.dtsi
> @@ -2026,6 +2026,60 @@ llcc: system-cache-controller@1100000 {
>   			interrupts = <GIC_SPI 582 IRQ_TYPE_LEVEL_HIGH>;
>   		};
>   
> +		pmu@1436400 {
> +			compatible = "qcom,sdm845-cpu-bwmon";
> +			reg = <0 0x01436400 0 0x600>;
> +
> +			interrupts = <GIC_SPI 581 IRQ_TYPE_LEVEL_HIGH>;
> +
> +			interconnects = <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>,
> +					<&osm_l3 MASTER_OSM_L3_APPS &osm_l3 SLAVE_OSM_L3>;
> +			interconnect-names = "ddr", "l3c";
> +
> +			operating-points-v2 = <&cpu_bwmon_opp_table>;
> +
> +			cpu_bwmon_opp_table: opp-table {
> +				compatible = "operating-points-v2";
> +
> +				/*
> +				 * The interconnect paths bandwidths taken from
> +				 * cpu4_opp_table bandwidth.
> +				 * They also match different tables from
> +				 * msm-4.9 downstream kernel:
> +				 *  - the gladiator_noc-mem_noc from bandwidth
> +				 *    table of qcom,llccbw (property qcom,bw-tbl);
> +				 *    bus width: 4 bytes;
> +				 *  - the OSM L3 from bandwidth table of
> +				 *    qcom,cpu4-l3lat-mon (qcom,core-dev-table);
> +				 *    bus width: 16 bytes;
> +				 */
> +				opp-0 {
> +					opp-peak-kBps = <800000 4800000>;
> +				};
> +				opp-1 {
> +					opp-peak-kBps = <1804000 9216000>;
> +				};
> +				opp-2 {
> +					opp-peak-kBps = <2188000 11980800>;
> +				};
> +				opp-3 {
> +					opp-peak-kBps = <3072000 15052800>;
> +				};
> +				opp-4 {
> +					opp-peak-kBps = <4068000 19353600>;
> +				};
> +				opp-5 {
> +					opp-peak-kBps = <5412000 20889600>;
> +				};
> +				opp-6 {
> +					opp-peak-kBps = <6220000 22425600>;
> +				};
> +				opp-7 {
> +					opp-peak-kBps = <7216000 25497600>;
> +				};
> +			};
> +		};
> +
>   		pcie0: pci@1c00000 {
>   			compatible = "qcom,pcie-sdm845";
>   			reg = <0 0x01c00000 0 0x2000>,


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 1/4] dt-bindings: interconnect: qcom,sdm845-cpu-bwmon: add BWMON device
  2022-06-01 10:11   ` Krzysztof Kozlowski
@ 2022-06-06 21:11     ` Bjorn Andersson
  -1 siblings, 0 replies; 52+ messages in thread
From: Bjorn Andersson @ 2022-06-06 21:11 UTC (permalink / raw)
  To: Krzysztof Kozlowski
  Cc: Andy Gross, Georgi Djakov, Rob Herring, Catalin Marinas,
	Will Deacon, linux-arm-msm, linux-pm, devicetree, linux-kernel,
	linux-arm-kernel, Rob Herring

On Wed 01 Jun 03:11 PDT 2022, Krzysztof Kozlowski wrote:

> Add bindings for the Qualcomm Bandwidth Monitor device providing
> performance data on interconnects.  The bindings describe only BWMON
> version 4, e.g. the instance on SDM845 between CPU and Last Level Cache
> Controller.
> 
> Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
> Reviewed-by: Rob Herring <robh@kernel.org>
> Acked-by: Georgi Djakov <djakov@kernel.org>
> ---
>  .../interconnect/qcom,sdm845-cpu-bwmon.yaml   | 97 +++++++++++++++++++
>  1 file changed, 97 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/interconnect/qcom,sdm845-cpu-bwmon.yaml
> 
> diff --git a/Documentation/devicetree/bindings/interconnect/qcom,sdm845-cpu-bwmon.yaml b/Documentation/devicetree/bindings/interconnect/qcom,sdm845-cpu-bwmon.yaml
> new file mode 100644
> index 000000000000..8c82e06ee432
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/interconnect/qcom,sdm845-cpu-bwmon.yaml
> @@ -0,0 +1,97 @@
> +# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
> +%YAML 1.2
> +---
> +$id: http://devicetree.org/schemas/interconnect/qcom,sdm845-cpu-bwmon.yaml#
> +$schema: http://devicetree.org/meta-schemas/core.yaml#
> +
> +title: Qualcomm Interconnect Bandwidth Monitor
> +
> +maintainers:
> +  - Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
> +
> +description:
> +  Bandwidth Monitor measures current throughput on buses between various NoC
> +  fabrics and provides information when it crosses configured thresholds.
> +
> +properties:
> +  compatible:
> +    enum:
> +      - qcom,sdm845-cpu-bwmon       # BWMON v4

It seems the thing that's called bwmon v4 is compatible with a number of
different platforms, should we add a generic compatible to the binding
as well, to avoid having to update the implementation for each SoC?

(I.e. "qcom,sdm845-cpu-bwmon", "qcom,bwmon-v4")

Regards,
Bjorn

> +
> +  interconnects:
> +    maxItems: 2
> +
> +  interconnect-names:
> +    items:
> +      - const: ddr
> +      - const: l3c
> +
> +  interrupts:
> +    maxItems: 1
> +
> +  operating-points-v2: true
> +  opp-table: true
> +
> +  reg:
> +    # Currently described BWMON v4 and v5 use one register address space.
> +    # BWMON v2 uses two register spaces - not yet described.
> +    maxItems: 1
> +
> +required:
> +  - compatible
> +  - interconnects
> +  - interconnect-names
> +  - interrupts
> +  - operating-points-v2
> +  - opp-table
> +  - reg
> +
> +additionalProperties: false
> +
> +examples:
> +  - |
> +    #include <dt-bindings/interconnect/qcom,osm-l3.h>
> +    #include <dt-bindings/interconnect/qcom,sdm845.h>
> +    #include <dt-bindings/interrupt-controller/arm-gic.h>
> +
> +    pmu@1436400 {
> +        compatible = "qcom,sdm845-cpu-bwmon";
> +        reg = <0x01436400 0x600>;
> +
> +        interrupts = <GIC_SPI 581 IRQ_TYPE_LEVEL_HIGH>;
> +
> +        interconnects = <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>,
> +                        <&osm_l3 MASTER_OSM_L3_APPS &osm_l3 SLAVE_OSM_L3>;
> +        interconnect-names = "ddr", "l3c";
> +
> +        operating-points-v2 = <&cpu_bwmon_opp_table>;
> +
> +        cpu_bwmon_opp_table: opp-table {
> +            compatible = "operating-points-v2";
> +
> +            opp-0 {
> +                opp-peak-kBps = <800000 4800000>;
> +            };
> +            opp-1 {
> +                opp-peak-kBps = <1804000 9216000>;
> +            };
> +            opp-2 {
> +                opp-peak-kBps = <2188000 11980800>;
> +            };
> +            opp-3 {
> +                opp-peak-kBps = <3072000 15052800>;
> +            };
> +            opp-4 {
> +                opp-peak-kBps = <4068000 19353600>;
> +            };
> +            opp-5 {
> +                opp-peak-kBps = <5412000 20889600>;
> +            };
> +            opp-6 {
> +                opp-peak-kBps = <6220000 22425600>;
> +            };
> +            opp-7 {
> +                opp-peak-kBps = <7216000 25497600>;
> +            };
> +        };
> +    };
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 1/4] dt-bindings: interconnect: qcom,sdm845-cpu-bwmon: add BWMON device
@ 2022-06-06 21:11     ` Bjorn Andersson
  0 siblings, 0 replies; 52+ messages in thread
From: Bjorn Andersson @ 2022-06-06 21:11 UTC (permalink / raw)
  To: Krzysztof Kozlowski
  Cc: Andy Gross, Georgi Djakov, Rob Herring, Catalin Marinas,
	Will Deacon, linux-arm-msm, linux-pm, devicetree, linux-kernel,
	linux-arm-kernel, Rob Herring

On Wed 01 Jun 03:11 PDT 2022, Krzysztof Kozlowski wrote:

> Add bindings for the Qualcomm Bandwidth Monitor device providing
> performance data on interconnects.  The bindings describe only BWMON
> version 4, e.g. the instance on SDM845 between CPU and Last Level Cache
> Controller.
> 
> Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
> Reviewed-by: Rob Herring <robh@kernel.org>
> Acked-by: Georgi Djakov <djakov@kernel.org>
> ---
>  .../interconnect/qcom,sdm845-cpu-bwmon.yaml   | 97 +++++++++++++++++++
>  1 file changed, 97 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/interconnect/qcom,sdm845-cpu-bwmon.yaml
> 
> diff --git a/Documentation/devicetree/bindings/interconnect/qcom,sdm845-cpu-bwmon.yaml b/Documentation/devicetree/bindings/interconnect/qcom,sdm845-cpu-bwmon.yaml
> new file mode 100644
> index 000000000000..8c82e06ee432
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/interconnect/qcom,sdm845-cpu-bwmon.yaml
> @@ -0,0 +1,97 @@
> +# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
> +%YAML 1.2
> +---
> +$id: http://devicetree.org/schemas/interconnect/qcom,sdm845-cpu-bwmon.yaml#
> +$schema: http://devicetree.org/meta-schemas/core.yaml#
> +
> +title: Qualcomm Interconnect Bandwidth Monitor
> +
> +maintainers:
> +  - Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
> +
> +description:
> +  Bandwidth Monitor measures current throughput on buses between various NoC
> +  fabrics and provides information when it crosses configured thresholds.
> +
> +properties:
> +  compatible:
> +    enum:
> +      - qcom,sdm845-cpu-bwmon       # BWMON v4

It seems the thing that's called bwmon v4 is compatible with a number of
different platforms, should we add a generic compatible to the binding
as well, to avoid having to update the implementation for each SoC?

(I.e. "qcom,sdm845-cpu-bwmon", "qcom,bwmon-v4")

Regards,
Bjorn

> +
> +  interconnects:
> +    maxItems: 2
> +
> +  interconnect-names:
> +    items:
> +      - const: ddr
> +      - const: l3c
> +
> +  interrupts:
> +    maxItems: 1
> +
> +  operating-points-v2: true
> +  opp-table: true
> +
> +  reg:
> +    # Currently described BWMON v4 and v5 use one register address space.
> +    # BWMON v2 uses two register spaces - not yet described.
> +    maxItems: 1
> +
> +required:
> +  - compatible
> +  - interconnects
> +  - interconnect-names
> +  - interrupts
> +  - operating-points-v2
> +  - opp-table
> +  - reg
> +
> +additionalProperties: false
> +
> +examples:
> +  - |
> +    #include <dt-bindings/interconnect/qcom,osm-l3.h>
> +    #include <dt-bindings/interconnect/qcom,sdm845.h>
> +    #include <dt-bindings/interrupt-controller/arm-gic.h>
> +
> +    pmu@1436400 {
> +        compatible = "qcom,sdm845-cpu-bwmon";
> +        reg = <0x01436400 0x600>;
> +
> +        interrupts = <GIC_SPI 581 IRQ_TYPE_LEVEL_HIGH>;
> +
> +        interconnects = <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>,
> +                        <&osm_l3 MASTER_OSM_L3_APPS &osm_l3 SLAVE_OSM_L3>;
> +        interconnect-names = "ddr", "l3c";
> +
> +        operating-points-v2 = <&cpu_bwmon_opp_table>;
> +
> +        cpu_bwmon_opp_table: opp-table {
> +            compatible = "operating-points-v2";
> +
> +            opp-0 {
> +                opp-peak-kBps = <800000 4800000>;
> +            };
> +            opp-1 {
> +                opp-peak-kBps = <1804000 9216000>;
> +            };
> +            opp-2 {
> +                opp-peak-kBps = <2188000 11980800>;
> +            };
> +            opp-3 {
> +                opp-peak-kBps = <3072000 15052800>;
> +            };
> +            opp-4 {
> +                opp-peak-kBps = <4068000 19353600>;
> +            };
> +            opp-5 {
> +                opp-peak-kBps = <5412000 20889600>;
> +            };
> +            opp-6 {
> +                opp-peak-kBps = <6220000 22425600>;
> +            };
> +            opp-7 {
> +                opp-peak-kBps = <7216000 25497600>;
> +            };
> +        };
> +    };
> -- 
> 2.34.1
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 4/4] arm64: dts: qcom: sdm845: Add CPU BWMON
  2022-06-06 20:39     ` Georgi Djakov
@ 2022-06-07  6:48       ` Krzysztof Kozlowski
  -1 siblings, 0 replies; 52+ messages in thread
From: Krzysztof Kozlowski @ 2022-06-07  6:48 UTC (permalink / raw)
  To: Georgi Djakov, Andy Gross, Bjorn Andersson, Rob Herring,
	Catalin Marinas, Will Deacon, linux-arm-msm, linux-pm,
	devicetree, linux-kernel, linux-arm-kernel
  Cc: Thara Gopinath

On 06/06/2022 22:39, Georgi Djakov wrote:
> On 1.06.22 13:11, Krzysztof Kozlowski wrote:
>> Add device node for CPU-memory BWMON device (bandwidth monitoring) on
>> SDM845 measuring bandwidth between CPU (gladiator_noc) and Last Level
>> Cache (memnoc).  Usage of this BWMON allows to remove fixed bandwidth
>> votes from cpufreq (CPU nodes) thus achieve high memory throughput even
>> with lower CPU frequencies.
>>
>> Performance impact (SDM845-MTP RB3 board, linux next-20220422):
>> 1. No noticeable impact when running with schedutil or performance
>>     governors.
>>
>> 2. When comparing to customized kernel with synced interconnects and
>>     without bandwidth votes from CPU freq, the sysbench memory tests
>>     show significant improvement with bwmon for blocksizes past the L3
>>     cache.  The results for such superficial comparison:
>>
>> sysbench memory test, results in MB/s (higher is better)
>>   bs kB |  type |    V  | V+no bw votes | bwmon | benefit %
>>       1 | W/seq | 14795 |          4816 |  4985 |      3.5%
>>      64 | W/seq | 41987 |         10334 | 10433 |      1.0%
>>    4096 | W/seq | 29768 |          8728 | 32007 |    266.7%
>>   65536 | W/seq | 17711 |          4846 | 18399 |    279.6%
>> 262144 | W/seq | 16112 |          4538 | 17429 |    284.1%
>>      64 | R/seq | 61202 |         67092 | 66804 |     -0.4%
>>    4096 | R/seq | 23871 |          5458 | 24307 |    345.4%
>>   65536 | R/seq | 18554 |          4240 | 18685 |    340.7%
>> 262144 | R/seq | 17524 |          4207 | 17774 |    322.4%
>>      64 | W/rnd |  2663 |          1098 |  1119 |      1.9%
>>   65536 | W/rnd |   600 |           316 |   610 |     92.7%
>>      64 | R/rnd |  4915 |          4784 |  4594 |     -4.0%
>>   65536 | R/rnd |   664 |           281 |   678 |    140.7%
>>
>> Legend:
>> bs kB: block size in KB (small block size means only L1-3 caches are
>>        used
>> type: R - read, W - write, seq - sequential, rnd - random
>> V: vanilla (next-20220422)
>> V + no bw votes: vanilla without bandwidth votes from CPU freq
>> bwmon: bwmon without bandwidth votes from CPU freq
>> benefit %: difference between vanilla without bandwidth votes and bwmon
>>             (higher is better)
>>
> 
> Ok, now i see! So bwmon shows similar performance compared with the current
> cpufreq-based bandwidth scaling. And if you add bwmon on top of vanilla, are
> the results close/same? 

Vanilla + bwmon results in almost no difference.

> Is the plan to remove the cpufreq based bandwidth
> scaling and switch to bwmon? It might improve the power consumption in some
> scenarios.

The next plan would be to implement the second bwmon, one between CPU
and caches. With both of them, the cpufreq bandwidth votes can be
removed (I think Android might be interested in this).


Best regards,
Krzysztof

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 4/4] arm64: dts: qcom: sdm845: Add CPU BWMON
@ 2022-06-07  6:48       ` Krzysztof Kozlowski
  0 siblings, 0 replies; 52+ messages in thread
From: Krzysztof Kozlowski @ 2022-06-07  6:48 UTC (permalink / raw)
  To: Georgi Djakov, Andy Gross, Bjorn Andersson, Rob Herring,
	Catalin Marinas, Will Deacon, linux-arm-msm, linux-pm,
	devicetree, linux-kernel, linux-arm-kernel
  Cc: Thara Gopinath

On 06/06/2022 22:39, Georgi Djakov wrote:
> On 1.06.22 13:11, Krzysztof Kozlowski wrote:
>> Add device node for CPU-memory BWMON device (bandwidth monitoring) on
>> SDM845 measuring bandwidth between CPU (gladiator_noc) and Last Level
>> Cache (memnoc).  Usage of this BWMON allows to remove fixed bandwidth
>> votes from cpufreq (CPU nodes) thus achieve high memory throughput even
>> with lower CPU frequencies.
>>
>> Performance impact (SDM845-MTP RB3 board, linux next-20220422):
>> 1. No noticeable impact when running with schedutil or performance
>>     governors.
>>
>> 2. When comparing to customized kernel with synced interconnects and
>>     without bandwidth votes from CPU freq, the sysbench memory tests
>>     show significant improvement with bwmon for blocksizes past the L3
>>     cache.  The results for such superficial comparison:
>>
>> sysbench memory test, results in MB/s (higher is better)
>>   bs kB |  type |    V  | V+no bw votes | bwmon | benefit %
>>       1 | W/seq | 14795 |          4816 |  4985 |      3.5%
>>      64 | W/seq | 41987 |         10334 | 10433 |      1.0%
>>    4096 | W/seq | 29768 |          8728 | 32007 |    266.7%
>>   65536 | W/seq | 17711 |          4846 | 18399 |    279.6%
>> 262144 | W/seq | 16112 |          4538 | 17429 |    284.1%
>>      64 | R/seq | 61202 |         67092 | 66804 |     -0.4%
>>    4096 | R/seq | 23871 |          5458 | 24307 |    345.4%
>>   65536 | R/seq | 18554 |          4240 | 18685 |    340.7%
>> 262144 | R/seq | 17524 |          4207 | 17774 |    322.4%
>>      64 | W/rnd |  2663 |          1098 |  1119 |      1.9%
>>   65536 | W/rnd |   600 |           316 |   610 |     92.7%
>>      64 | R/rnd |  4915 |          4784 |  4594 |     -4.0%
>>   65536 | R/rnd |   664 |           281 |   678 |    140.7%
>>
>> Legend:
>> bs kB: block size in KB (small block size means only L1-3 caches are
>>        used
>> type: R - read, W - write, seq - sequential, rnd - random
>> V: vanilla (next-20220422)
>> V + no bw votes: vanilla without bandwidth votes from CPU freq
>> bwmon: bwmon without bandwidth votes from CPU freq
>> benefit %: difference between vanilla without bandwidth votes and bwmon
>>             (higher is better)
>>
> 
> Ok, now i see! So bwmon shows similar performance compared with the current
> cpufreq-based bandwidth scaling. And if you add bwmon on top of vanilla, are
> the results close/same? 

Vanilla + bwmon results in almost no difference.

> Is the plan to remove the cpufreq based bandwidth
> scaling and switch to bwmon? It might improve the power consumption in some
> scenarios.

The next plan would be to implement the second bwmon, one between CPU
and caches. With both of them, the cpufreq bandwidth votes can be
removed (I think Android might be interested in this).


Best regards,
Krzysztof

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 1/4] dt-bindings: interconnect: qcom,sdm845-cpu-bwmon: add BWMON device
  2022-06-06 21:11     ` Bjorn Andersson
@ 2022-06-07  6:50       ` Krzysztof Kozlowski
  -1 siblings, 0 replies; 52+ messages in thread
From: Krzysztof Kozlowski @ 2022-06-07  6:50 UTC (permalink / raw)
  To: Bjorn Andersson
  Cc: Andy Gross, Georgi Djakov, Rob Herring, Catalin Marinas,
	Will Deacon, linux-arm-msm, linux-pm, devicetree, linux-kernel,
	linux-arm-kernel, Rob Herring

On 06/06/2022 23:11, Bjorn Andersson wrote:
> On Wed 01 Jun 03:11 PDT 2022, Krzysztof Kozlowski wrote:
> 
>> Add bindings for the Qualcomm Bandwidth Monitor device providing
>> performance data on interconnects.  The bindings describe only BWMON
>> version 4, e.g. the instance on SDM845 between CPU and Last Level Cache
>> Controller.
>>
>> Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
>> Reviewed-by: Rob Herring <robh@kernel.org>
>> Acked-by: Georgi Djakov <djakov@kernel.org>
>> ---
>>  .../interconnect/qcom,sdm845-cpu-bwmon.yaml   | 97 +++++++++++++++++++
>>  1 file changed, 97 insertions(+)
>>  create mode 100644 Documentation/devicetree/bindings/interconnect/qcom,sdm845-cpu-bwmon.yaml
>>
>> diff --git a/Documentation/devicetree/bindings/interconnect/qcom,sdm845-cpu-bwmon.yaml b/Documentation/devicetree/bindings/interconnect/qcom,sdm845-cpu-bwmon.yaml
>> new file mode 100644
>> index 000000000000..8c82e06ee432
>> --- /dev/null
>> +++ b/Documentation/devicetree/bindings/interconnect/qcom,sdm845-cpu-bwmon.yaml
>> @@ -0,0 +1,97 @@
>> +# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
>> +%YAML 1.2
>> +---
>> +$id: http://devicetree.org/schemas/interconnect/qcom,sdm845-cpu-bwmon.yaml#
>> +$schema: http://devicetree.org/meta-schemas/core.yaml#
>> +
>> +title: Qualcomm Interconnect Bandwidth Monitor
>> +
>> +maintainers:
>> +  - Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
>> +
>> +description:
>> +  Bandwidth Monitor measures current throughput on buses between various NoC
>> +  fabrics and provides information when it crosses configured thresholds.
>> +
>> +properties:
>> +  compatible:
>> +    enum:
>> +      - qcom,sdm845-cpu-bwmon       # BWMON v4
> 
> It seems the thing that's called bwmon v4 is compatible with a number of
> different platforms, should we add a generic compatible to the binding
> as well, to avoid having to update the implementation for each SoC?
> 
> (I.e. "qcom,sdm845-cpu-bwmon", "qcom,bwmon-v4")

I am hesitant. I could not find BWMON IP block versioning in the
Qualcomm docs. Only the downstream sources had it. Therefore I think it
is more applicable to use this one as fallback for other boards, e.g.:

"qcom,sdm660-cpu-bwmon", "qcom,sdm845-cpu-bwmon"
(even if the number is a bit odd - newer comes as last compatible).

What's your preference?

Best regards,
Krzysztof

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 1/4] dt-bindings: interconnect: qcom,sdm845-cpu-bwmon: add BWMON device
@ 2022-06-07  6:50       ` Krzysztof Kozlowski
  0 siblings, 0 replies; 52+ messages in thread
From: Krzysztof Kozlowski @ 2022-06-07  6:50 UTC (permalink / raw)
  To: Bjorn Andersson
  Cc: Andy Gross, Georgi Djakov, Rob Herring, Catalin Marinas,
	Will Deacon, linux-arm-msm, linux-pm, devicetree, linux-kernel,
	linux-arm-kernel, Rob Herring

On 06/06/2022 23:11, Bjorn Andersson wrote:
> On Wed 01 Jun 03:11 PDT 2022, Krzysztof Kozlowski wrote:
> 
>> Add bindings for the Qualcomm Bandwidth Monitor device providing
>> performance data on interconnects.  The bindings describe only BWMON
>> version 4, e.g. the instance on SDM845 between CPU and Last Level Cache
>> Controller.
>>
>> Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
>> Reviewed-by: Rob Herring <robh@kernel.org>
>> Acked-by: Georgi Djakov <djakov@kernel.org>
>> ---
>>  .../interconnect/qcom,sdm845-cpu-bwmon.yaml   | 97 +++++++++++++++++++
>>  1 file changed, 97 insertions(+)
>>  create mode 100644 Documentation/devicetree/bindings/interconnect/qcom,sdm845-cpu-bwmon.yaml
>>
>> diff --git a/Documentation/devicetree/bindings/interconnect/qcom,sdm845-cpu-bwmon.yaml b/Documentation/devicetree/bindings/interconnect/qcom,sdm845-cpu-bwmon.yaml
>> new file mode 100644
>> index 000000000000..8c82e06ee432
>> --- /dev/null
>> +++ b/Documentation/devicetree/bindings/interconnect/qcom,sdm845-cpu-bwmon.yaml
>> @@ -0,0 +1,97 @@
>> +# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
>> +%YAML 1.2
>> +---
>> +$id: http://devicetree.org/schemas/interconnect/qcom,sdm845-cpu-bwmon.yaml#
>> +$schema: http://devicetree.org/meta-schemas/core.yaml#
>> +
>> +title: Qualcomm Interconnect Bandwidth Monitor
>> +
>> +maintainers:
>> +  - Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
>> +
>> +description:
>> +  Bandwidth Monitor measures current throughput on buses between various NoC
>> +  fabrics and provides information when it crosses configured thresholds.
>> +
>> +properties:
>> +  compatible:
>> +    enum:
>> +      - qcom,sdm845-cpu-bwmon       # BWMON v4
> 
> It seems the thing that's called bwmon v4 is compatible with a number of
> different platforms, should we add a generic compatible to the binding
> as well, to avoid having to update the implementation for each SoC?
> 
> (I.e. "qcom,sdm845-cpu-bwmon", "qcom,bwmon-v4")

I am hesitant. I could not find BWMON IP block versioning in the
Qualcomm docs. Only the downstream sources had it. Therefore I think it
is more applicable to use this one as fallback for other boards, e.g.:

"qcom,sdm660-cpu-bwmon", "qcom,sdm845-cpu-bwmon"
(even if the number is a bit odd - newer comes as last compatible).

What's your preference?

Best regards,
Krzysztof

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 4/4] arm64: dts: qcom: sdm845: Add CPU BWMON
  2022-06-01 10:11   ` Krzysztof Kozlowski
@ 2022-06-22 11:46     ` Rajendra Nayak
  -1 siblings, 0 replies; 52+ messages in thread
From: Rajendra Nayak @ 2022-06-22 11:46 UTC (permalink / raw)
  To: Krzysztof Kozlowski, Andy Gross, Bjorn Andersson, Georgi Djakov,
	Rob Herring, Catalin Marinas, Will Deacon, linux-arm-msm,
	linux-pm, devicetree, linux-kernel, linux-arm-kernel
  Cc: Thara Gopinath


On 6/1/2022 3:41 PM, Krzysztof Kozlowski wrote:
> Add device node for CPU-memory BWMON device (bandwidth monitoring) on
> SDM845 measuring bandwidth between CPU (gladiator_noc) and Last Level
> Cache (memnoc).  Usage of this BWMON allows to remove fixed bandwidth
> votes from cpufreq (CPU nodes) thus achieve high memory throughput even
> with lower CPU frequencies.
> 
> Performance impact (SDM845-MTP RB3 board, linux next-20220422):
> 1. No noticeable impact when running with schedutil or performance
>     governors.
> 
> 2. When comparing to customized kernel with synced interconnects and
>     without bandwidth votes from CPU freq, the sysbench memory tests
>     show significant improvement with bwmon for blocksizes past the L3
>     cache.  The results for such superficial comparison:
> 
> sysbench memory test, results in MB/s (higher is better)
>   bs kB |  type |    V  | V+no bw votes | bwmon | benefit %
>       1 | W/seq | 14795 |          4816 |  4985 |      3.5%
>      64 | W/seq | 41987 |         10334 | 10433 |      1.0%
>    4096 | W/seq | 29768 |          8728 | 32007 |    266.7%
>   65536 | W/seq | 17711 |          4846 | 18399 |    279.6%
> 262144 | W/seq | 16112 |          4538 | 17429 |    284.1%
>      64 | R/seq | 61202 |         67092 | 66804 |     -0.4%
>    4096 | R/seq | 23871 |          5458 | 24307 |    345.4%
>   65536 | R/seq | 18554 |          4240 | 18685 |    340.7%
> 262144 | R/seq | 17524 |          4207 | 17774 |    322.4%
>      64 | W/rnd |  2663 |          1098 |  1119 |      1.9%
>   65536 | W/rnd |   600 |           316 |   610 |     92.7%
>      64 | R/rnd |  4915 |          4784 |  4594 |     -4.0%
>   65536 | R/rnd |   664 |           281 |   678 |    140.7%
> 
> Legend:
> bs kB: block size in KB (small block size means only L1-3 caches are
>        used
> type: R - read, W - write, seq - sequential, rnd - random
> V: vanilla (next-20220422)
> V + no bw votes: vanilla without bandwidth votes from CPU freq
> bwmon: bwmon without bandwidth votes from CPU freq
> benefit %: difference between vanilla without bandwidth votes and bwmon
>             (higher is better)
> 
> Co-developed-by: Thara Gopinath <thara.gopinath@linaro.org>
> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
> Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
> ---
>   arch/arm64/boot/dts/qcom/sdm845.dtsi | 54 ++++++++++++++++++++++++++++
>   1 file changed, 54 insertions(+)
> 
> diff --git a/arch/arm64/boot/dts/qcom/sdm845.dtsi b/arch/arm64/boot/dts/qcom/sdm845.dtsi
> index 83e8b63f0910..adffb9c70566 100644
> --- a/arch/arm64/boot/dts/qcom/sdm845.dtsi
> +++ b/arch/arm64/boot/dts/qcom/sdm845.dtsi
> @@ -2026,6 +2026,60 @@ llcc: system-cache-controller@1100000 {
>   			interrupts = <GIC_SPI 582 IRQ_TYPE_LEVEL_HIGH>;
>   		};
>   
> +		pmu@1436400 {
> +			compatible = "qcom,sdm845-cpu-bwmon";
> +			reg = <0 0x01436400 0 0x600>;
> +
> +			interrupts = <GIC_SPI 581 IRQ_TYPE_LEVEL_HIGH>;
> +
> +			interconnects = <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>,
> +					<&osm_l3 MASTER_OSM_L3_APPS &osm_l3 SLAVE_OSM_L3>;
> +			interconnect-names = "ddr", "l3c";

Is this the pmu/bwmon instance between the cpu and caches or the one between the caches and DDR?
Depending on which one it is, shouldn;t we just be scaling either one and not both the interconnect paths?

> +
> +			operating-points-v2 = <&cpu_bwmon_opp_table>;
> +
> +			cpu_bwmon_opp_table: opp-table {
> +				compatible = "operating-points-v2";
> +
> +				/*
> +				 * The interconnect paths bandwidths taken from
> +				 * cpu4_opp_table bandwidth.
> +				 * They also match different tables from
> +				 * msm-4.9 downstream kernel:
> +				 *  - the gladiator_noc-mem_noc from bandwidth
> +				 *    table of qcom,llccbw (property qcom,bw-tbl);
> +				 *    bus width: 4 bytes;
> +				 *  - the OSM L3 from bandwidth table of
> +				 *    qcom,cpu4-l3lat-mon (qcom,core-dev-table);
> +				 *    bus width: 16 bytes;
> +				 */
> +				opp-0 {
> +					opp-peak-kBps = <800000 4800000>;
> +				};
> +				opp-1 {
> +					opp-peak-kBps = <1804000 9216000>;
> +				};
> +				opp-2 {
> +					opp-peak-kBps = <2188000 11980800>;
> +				};
> +				opp-3 {
> +					opp-peak-kBps = <3072000 15052800>;
> +				};
> +				opp-4 {
> +					opp-peak-kBps = <4068000 19353600>;
> +				};
> +				opp-5 {
> +					opp-peak-kBps = <5412000 20889600>;
> +				};
> +				opp-6 {
> +					opp-peak-kBps = <6220000 22425600>;
> +				};
> +				opp-7 {
> +					opp-peak-kBps = <7216000 25497600>;
> +				};
> +			};
> +		};
> +
>   		pcie0: pci@1c00000 {
>   			compatible = "qcom,pcie-sdm845";
>   			reg = <0 0x01c00000 0 0x2000>,

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 4/4] arm64: dts: qcom: sdm845: Add CPU BWMON
@ 2022-06-22 11:46     ` Rajendra Nayak
  0 siblings, 0 replies; 52+ messages in thread
From: Rajendra Nayak @ 2022-06-22 11:46 UTC (permalink / raw)
  To: Krzysztof Kozlowski, Andy Gross, Bjorn Andersson, Georgi Djakov,
	Rob Herring, Catalin Marinas, Will Deacon, linux-arm-msm,
	linux-pm, devicetree, linux-kernel, linux-arm-kernel
  Cc: Thara Gopinath


On 6/1/2022 3:41 PM, Krzysztof Kozlowski wrote:
> Add device node for CPU-memory BWMON device (bandwidth monitoring) on
> SDM845 measuring bandwidth between CPU (gladiator_noc) and Last Level
> Cache (memnoc).  Usage of this BWMON allows to remove fixed bandwidth
> votes from cpufreq (CPU nodes) thus achieve high memory throughput even
> with lower CPU frequencies.
> 
> Performance impact (SDM845-MTP RB3 board, linux next-20220422):
> 1. No noticeable impact when running with schedutil or performance
>     governors.
> 
> 2. When comparing to customized kernel with synced interconnects and
>     without bandwidth votes from CPU freq, the sysbench memory tests
>     show significant improvement with bwmon for blocksizes past the L3
>     cache.  The results for such superficial comparison:
> 
> sysbench memory test, results in MB/s (higher is better)
>   bs kB |  type |    V  | V+no bw votes | bwmon | benefit %
>       1 | W/seq | 14795 |          4816 |  4985 |      3.5%
>      64 | W/seq | 41987 |         10334 | 10433 |      1.0%
>    4096 | W/seq | 29768 |          8728 | 32007 |    266.7%
>   65536 | W/seq | 17711 |          4846 | 18399 |    279.6%
> 262144 | W/seq | 16112 |          4538 | 17429 |    284.1%
>      64 | R/seq | 61202 |         67092 | 66804 |     -0.4%
>    4096 | R/seq | 23871 |          5458 | 24307 |    345.4%
>   65536 | R/seq | 18554 |          4240 | 18685 |    340.7%
> 262144 | R/seq | 17524 |          4207 | 17774 |    322.4%
>      64 | W/rnd |  2663 |          1098 |  1119 |      1.9%
>   65536 | W/rnd |   600 |           316 |   610 |     92.7%
>      64 | R/rnd |  4915 |          4784 |  4594 |     -4.0%
>   65536 | R/rnd |   664 |           281 |   678 |    140.7%
> 
> Legend:
> bs kB: block size in KB (small block size means only L1-3 caches are
>        used
> type: R - read, W - write, seq - sequential, rnd - random
> V: vanilla (next-20220422)
> V + no bw votes: vanilla without bandwidth votes from CPU freq
> bwmon: bwmon without bandwidth votes from CPU freq
> benefit %: difference between vanilla without bandwidth votes and bwmon
>             (higher is better)
> 
> Co-developed-by: Thara Gopinath <thara.gopinath@linaro.org>
> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
> Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
> ---
>   arch/arm64/boot/dts/qcom/sdm845.dtsi | 54 ++++++++++++++++++++++++++++
>   1 file changed, 54 insertions(+)
> 
> diff --git a/arch/arm64/boot/dts/qcom/sdm845.dtsi b/arch/arm64/boot/dts/qcom/sdm845.dtsi
> index 83e8b63f0910..adffb9c70566 100644
> --- a/arch/arm64/boot/dts/qcom/sdm845.dtsi
> +++ b/arch/arm64/boot/dts/qcom/sdm845.dtsi
> @@ -2026,6 +2026,60 @@ llcc: system-cache-controller@1100000 {
>   			interrupts = <GIC_SPI 582 IRQ_TYPE_LEVEL_HIGH>;
>   		};
>   
> +		pmu@1436400 {
> +			compatible = "qcom,sdm845-cpu-bwmon";
> +			reg = <0 0x01436400 0 0x600>;
> +
> +			interrupts = <GIC_SPI 581 IRQ_TYPE_LEVEL_HIGH>;
> +
> +			interconnects = <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>,
> +					<&osm_l3 MASTER_OSM_L3_APPS &osm_l3 SLAVE_OSM_L3>;
> +			interconnect-names = "ddr", "l3c";

Is this the pmu/bwmon instance between the cpu and caches or the one between the caches and DDR?
Depending on which one it is, shouldn;t we just be scaling either one and not both the interconnect paths?

> +
> +			operating-points-v2 = <&cpu_bwmon_opp_table>;
> +
> +			cpu_bwmon_opp_table: opp-table {
> +				compatible = "operating-points-v2";
> +
> +				/*
> +				 * The interconnect paths bandwidths taken from
> +				 * cpu4_opp_table bandwidth.
> +				 * They also match different tables from
> +				 * msm-4.9 downstream kernel:
> +				 *  - the gladiator_noc-mem_noc from bandwidth
> +				 *    table of qcom,llccbw (property qcom,bw-tbl);
> +				 *    bus width: 4 bytes;
> +				 *  - the OSM L3 from bandwidth table of
> +				 *    qcom,cpu4-l3lat-mon (qcom,core-dev-table);
> +				 *    bus width: 16 bytes;
> +				 */
> +				opp-0 {
> +					opp-peak-kBps = <800000 4800000>;
> +				};
> +				opp-1 {
> +					opp-peak-kBps = <1804000 9216000>;
> +				};
> +				opp-2 {
> +					opp-peak-kBps = <2188000 11980800>;
> +				};
> +				opp-3 {
> +					opp-peak-kBps = <3072000 15052800>;
> +				};
> +				opp-4 {
> +					opp-peak-kBps = <4068000 19353600>;
> +				};
> +				opp-5 {
> +					opp-peak-kBps = <5412000 20889600>;
> +				};
> +				opp-6 {
> +					opp-peak-kBps = <6220000 22425600>;
> +				};
> +				opp-7 {
> +					opp-peak-kBps = <7216000 25497600>;
> +				};
> +			};
> +		};
> +
>   		pcie0: pci@1c00000 {
>   			compatible = "qcom,pcie-sdm845";
>   			reg = <0 0x01c00000 0 0x2000>,

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 1/4] dt-bindings: interconnect: qcom,sdm845-cpu-bwmon: add BWMON device
  2022-06-07  6:50       ` Krzysztof Kozlowski
@ 2022-06-22 11:58         ` Rajendra Nayak
  -1 siblings, 0 replies; 52+ messages in thread
From: Rajendra Nayak @ 2022-06-22 11:58 UTC (permalink / raw)
  To: Krzysztof Kozlowski, Bjorn Andersson
  Cc: Andy Gross, Georgi Djakov, Rob Herring, Catalin Marinas,
	Will Deacon, linux-arm-msm, linux-pm, devicetree, linux-kernel,
	linux-arm-kernel, Rob Herring



On 6/7/2022 12:20 PM, Krzysztof Kozlowski wrote:
> On 06/06/2022 23:11, Bjorn Andersson wrote:
>> On Wed 01 Jun 03:11 PDT 2022, Krzysztof Kozlowski wrote:
>>
>>> Add bindings for the Qualcomm Bandwidth Monitor device providing
>>> performance data on interconnects.  The bindings describe only BWMON
>>> version 4, e.g. the instance on SDM845 between CPU and Last Level Cache
>>> Controller.
>>>
>>> Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
>>> Reviewed-by: Rob Herring <robh@kernel.org>
>>> Acked-by: Georgi Djakov <djakov@kernel.org>
>>> ---
>>>   .../interconnect/qcom,sdm845-cpu-bwmon.yaml   | 97 +++++++++++++++++++
>>>   1 file changed, 97 insertions(+)
>>>   create mode 100644 Documentation/devicetree/bindings/interconnect/qcom,sdm845-cpu-bwmon.yaml
>>>
>>> diff --git a/Documentation/devicetree/bindings/interconnect/qcom,sdm845-cpu-bwmon.yaml b/Documentation/devicetree/bindings/interconnect/qcom,sdm845-cpu-bwmon.yaml
>>> new file mode 100644
>>> index 000000000000..8c82e06ee432
>>> --- /dev/null
>>> +++ b/Documentation/devicetree/bindings/interconnect/qcom,sdm845-cpu-bwmon.yaml
>>> @@ -0,0 +1,97 @@
>>> +# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
>>> +%YAML 1.2
>>> +---
>>> +$id: http://devicetree.org/schemas/interconnect/qcom,sdm845-cpu-bwmon.yaml#
>>> +$schema: http://devicetree.org/meta-schemas/core.yaml#
>>> +
>>> +title: Qualcomm Interconnect Bandwidth Monitor
>>> +
>>> +maintainers:
>>> +  - Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
>>> +
>>> +description:
>>> +  Bandwidth Monitor measures current throughput on buses between various NoC
>>> +  fabrics and provides information when it crosses configured thresholds.
>>> +
>>> +properties:
>>> +  compatible:
>>> +    enum:
>>> +      - qcom,sdm845-cpu-bwmon       # BWMON v4
>>
>> It seems the thing that's called bwmon v4 is compatible with a number of
>> different platforms, should we add a generic compatible to the binding
>> as well, to avoid having to update the implementation for each SoC?
>>
>> (I.e. "qcom,sdm845-cpu-bwmon", "qcom,bwmon-v4")

it seems pretty useful to have the "qcom,bwmon-v4" and "qcom,bwmon-v5"
compatibles, I tried these patches on a sc7280 device which has a bwmon4
between the cpu and caches (and also has a bwmon5 between the caches and DDR)
and the driver works with zero changes.

> 
> I am hesitant. I could not find BWMON IP block versioning in the
> Qualcomm docs. Only the downstream sources had it. Therefore I think it
> is more applicable to use this one as fallback for other boards, e.g.:
> 
> "qcom,sdm660-cpu-bwmon", "qcom,sdm845-cpu-bwmon"
> (even if the number is a bit odd - newer comes as last compatible).
> 
> What's your preference?
> 
> Best regards,
> Krzysztof

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 1/4] dt-bindings: interconnect: qcom,sdm845-cpu-bwmon: add BWMON device
@ 2022-06-22 11:58         ` Rajendra Nayak
  0 siblings, 0 replies; 52+ messages in thread
From: Rajendra Nayak @ 2022-06-22 11:58 UTC (permalink / raw)
  To: Krzysztof Kozlowski, Bjorn Andersson
  Cc: Andy Gross, Georgi Djakov, Rob Herring, Catalin Marinas,
	Will Deacon, linux-arm-msm, linux-pm, devicetree, linux-kernel,
	linux-arm-kernel, Rob Herring



On 6/7/2022 12:20 PM, Krzysztof Kozlowski wrote:
> On 06/06/2022 23:11, Bjorn Andersson wrote:
>> On Wed 01 Jun 03:11 PDT 2022, Krzysztof Kozlowski wrote:
>>
>>> Add bindings for the Qualcomm Bandwidth Monitor device providing
>>> performance data on interconnects.  The bindings describe only BWMON
>>> version 4, e.g. the instance on SDM845 between CPU and Last Level Cache
>>> Controller.
>>>
>>> Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
>>> Reviewed-by: Rob Herring <robh@kernel.org>
>>> Acked-by: Georgi Djakov <djakov@kernel.org>
>>> ---
>>>   .../interconnect/qcom,sdm845-cpu-bwmon.yaml   | 97 +++++++++++++++++++
>>>   1 file changed, 97 insertions(+)
>>>   create mode 100644 Documentation/devicetree/bindings/interconnect/qcom,sdm845-cpu-bwmon.yaml
>>>
>>> diff --git a/Documentation/devicetree/bindings/interconnect/qcom,sdm845-cpu-bwmon.yaml b/Documentation/devicetree/bindings/interconnect/qcom,sdm845-cpu-bwmon.yaml
>>> new file mode 100644
>>> index 000000000000..8c82e06ee432
>>> --- /dev/null
>>> +++ b/Documentation/devicetree/bindings/interconnect/qcom,sdm845-cpu-bwmon.yaml
>>> @@ -0,0 +1,97 @@
>>> +# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
>>> +%YAML 1.2
>>> +---
>>> +$id: http://devicetree.org/schemas/interconnect/qcom,sdm845-cpu-bwmon.yaml#
>>> +$schema: http://devicetree.org/meta-schemas/core.yaml#
>>> +
>>> +title: Qualcomm Interconnect Bandwidth Monitor
>>> +
>>> +maintainers:
>>> +  - Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
>>> +
>>> +description:
>>> +  Bandwidth Monitor measures current throughput on buses between various NoC
>>> +  fabrics and provides information when it crosses configured thresholds.
>>> +
>>> +properties:
>>> +  compatible:
>>> +    enum:
>>> +      - qcom,sdm845-cpu-bwmon       # BWMON v4
>>
>> It seems the thing that's called bwmon v4 is compatible with a number of
>> different platforms, should we add a generic compatible to the binding
>> as well, to avoid having to update the implementation for each SoC?
>>
>> (I.e. "qcom,sdm845-cpu-bwmon", "qcom,bwmon-v4")

it seems pretty useful to have the "qcom,bwmon-v4" and "qcom,bwmon-v5"
compatibles, I tried these patches on a sc7280 device which has a bwmon4
between the cpu and caches (and also has a bwmon5 between the caches and DDR)
and the driver works with zero changes.

> 
> I am hesitant. I could not find BWMON IP block versioning in the
> Qualcomm docs. Only the downstream sources had it. Therefore I think it
> is more applicable to use this one as fallback for other boards, e.g.:
> 
> "qcom,sdm660-cpu-bwmon", "qcom,sdm845-cpu-bwmon"
> (even if the number is a bit odd - newer comes as last compatible).
> 
> What's your preference?
> 
> Best regards,
> Krzysztof

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 1/4] dt-bindings: interconnect: qcom,sdm845-cpu-bwmon: add BWMON device
  2022-06-22 11:58         ` Rajendra Nayak
@ 2022-06-22 12:20           ` Krzysztof Kozlowski
  -1 siblings, 0 replies; 52+ messages in thread
From: Krzysztof Kozlowski @ 2022-06-22 12:20 UTC (permalink / raw)
  To: Rajendra Nayak, Bjorn Andersson
  Cc: Andy Gross, Georgi Djakov, Rob Herring, Catalin Marinas,
	Will Deacon, linux-arm-msm, linux-pm, devicetree, linux-kernel,
	linux-arm-kernel, Rob Herring

On 22/06/2022 13:58, Rajendra Nayak wrote:
> 
> 
> On 6/7/2022 12:20 PM, Krzysztof Kozlowski wrote:
>> On 06/06/2022 23:11, Bjorn Andersson wrote:
>>> On Wed 01 Jun 03:11 PDT 2022, Krzysztof Kozlowski wrote:
>>>
>>>> Add bindings for the Qualcomm Bandwidth Monitor device providing
>>>> performance data on interconnects.  The bindings describe only BWMON
>>>> version 4, e.g. the instance on SDM845 between CPU and Last Level Cache
>>>> Controller.
>>>>
>>>> Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
>>>> Reviewed-by: Rob Herring <robh@kernel.org>
>>>> Acked-by: Georgi Djakov <djakov@kernel.org>
>>>> ---
>>>>   .../interconnect/qcom,sdm845-cpu-bwmon.yaml   | 97 +++++++++++++++++++
>>>>   1 file changed, 97 insertions(+)
>>>>   create mode 100644 Documentation/devicetree/bindings/interconnect/qcom,sdm845-cpu-bwmon.yaml
>>>>
>>>> diff --git a/Documentation/devicetree/bindings/interconnect/qcom,sdm845-cpu-bwmon.yaml b/Documentation/devicetree/bindings/interconnect/qcom,sdm845-cpu-bwmon.yaml
>>>> new file mode 100644
>>>> index 000000000000..8c82e06ee432
>>>> --- /dev/null
>>>> +++ b/Documentation/devicetree/bindings/interconnect/qcom,sdm845-cpu-bwmon.yaml
>>>> @@ -0,0 +1,97 @@
>>>> +# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
>>>> +%YAML 1.2
>>>> +---
>>>> +$id: http://devicetree.org/schemas/interconnect/qcom,sdm845-cpu-bwmon.yaml#
>>>> +$schema: http://devicetree.org/meta-schemas/core.yaml#
>>>> +
>>>> +title: Qualcomm Interconnect Bandwidth Monitor
>>>> +
>>>> +maintainers:
>>>> +  - Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
>>>> +
>>>> +description:
>>>> +  Bandwidth Monitor measures current throughput on buses between various NoC
>>>> +  fabrics and provides information when it crosses configured thresholds.
>>>> +
>>>> +properties:
>>>> +  compatible:
>>>> +    enum:
>>>> +      - qcom,sdm845-cpu-bwmon       # BWMON v4
>>>
>>> It seems the thing that's called bwmon v4 is compatible with a number of
>>> different platforms, should we add a generic compatible to the binding
>>> as well, to avoid having to update the implementation for each SoC?
>>>
>>> (I.e. "qcom,sdm845-cpu-bwmon", "qcom,bwmon-v4")
> 
> it seems pretty useful to have the "qcom,bwmon-v4" and "qcom,bwmon-v5"
> compatibles, I tried these patches on a sc7280 device which has a bwmon4
> between the cpu and caches (and also has a bwmon5 between the caches and DDR)
> and the driver works with zero changes.

The trouble with naming it v4 is that such versioning does not exist in
documentation. At least I failed to find it. Neither there is clear
mapping between SoC and block version.

The only indication about BWMON versioning comes from downstream
sources, which I find not enough to justify usage of versions for blocks.

Therefore as per DT recommendation (which I am enforcing on others) I am
not planning put there bwmon-v4.

Best regards,
Krzysztof

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 1/4] dt-bindings: interconnect: qcom,sdm845-cpu-bwmon: add BWMON device
@ 2022-06-22 12:20           ` Krzysztof Kozlowski
  0 siblings, 0 replies; 52+ messages in thread
From: Krzysztof Kozlowski @ 2022-06-22 12:20 UTC (permalink / raw)
  To: Rajendra Nayak, Bjorn Andersson
  Cc: Andy Gross, Georgi Djakov, Rob Herring, Catalin Marinas,
	Will Deacon, linux-arm-msm, linux-pm, devicetree, linux-kernel,
	linux-arm-kernel, Rob Herring

On 22/06/2022 13:58, Rajendra Nayak wrote:
> 
> 
> On 6/7/2022 12:20 PM, Krzysztof Kozlowski wrote:
>> On 06/06/2022 23:11, Bjorn Andersson wrote:
>>> On Wed 01 Jun 03:11 PDT 2022, Krzysztof Kozlowski wrote:
>>>
>>>> Add bindings for the Qualcomm Bandwidth Monitor device providing
>>>> performance data on interconnects.  The bindings describe only BWMON
>>>> version 4, e.g. the instance on SDM845 between CPU and Last Level Cache
>>>> Controller.
>>>>
>>>> Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
>>>> Reviewed-by: Rob Herring <robh@kernel.org>
>>>> Acked-by: Georgi Djakov <djakov@kernel.org>
>>>> ---
>>>>   .../interconnect/qcom,sdm845-cpu-bwmon.yaml   | 97 +++++++++++++++++++
>>>>   1 file changed, 97 insertions(+)
>>>>   create mode 100644 Documentation/devicetree/bindings/interconnect/qcom,sdm845-cpu-bwmon.yaml
>>>>
>>>> diff --git a/Documentation/devicetree/bindings/interconnect/qcom,sdm845-cpu-bwmon.yaml b/Documentation/devicetree/bindings/interconnect/qcom,sdm845-cpu-bwmon.yaml
>>>> new file mode 100644
>>>> index 000000000000..8c82e06ee432
>>>> --- /dev/null
>>>> +++ b/Documentation/devicetree/bindings/interconnect/qcom,sdm845-cpu-bwmon.yaml
>>>> @@ -0,0 +1,97 @@
>>>> +# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
>>>> +%YAML 1.2
>>>> +---
>>>> +$id: http://devicetree.org/schemas/interconnect/qcom,sdm845-cpu-bwmon.yaml#
>>>> +$schema: http://devicetree.org/meta-schemas/core.yaml#
>>>> +
>>>> +title: Qualcomm Interconnect Bandwidth Monitor
>>>> +
>>>> +maintainers:
>>>> +  - Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
>>>> +
>>>> +description:
>>>> +  Bandwidth Monitor measures current throughput on buses between various NoC
>>>> +  fabrics and provides information when it crosses configured thresholds.
>>>> +
>>>> +properties:
>>>> +  compatible:
>>>> +    enum:
>>>> +      - qcom,sdm845-cpu-bwmon       # BWMON v4
>>>
>>> It seems the thing that's called bwmon v4 is compatible with a number of
>>> different platforms, should we add a generic compatible to the binding
>>> as well, to avoid having to update the implementation for each SoC?
>>>
>>> (I.e. "qcom,sdm845-cpu-bwmon", "qcom,bwmon-v4")
> 
> it seems pretty useful to have the "qcom,bwmon-v4" and "qcom,bwmon-v5"
> compatibles, I tried these patches on a sc7280 device which has a bwmon4
> between the cpu and caches (and also has a bwmon5 between the caches and DDR)
> and the driver works with zero changes.

The trouble with naming it v4 is that such versioning does not exist in
documentation. At least I failed to find it. Neither there is clear
mapping between SoC and block version.

The only indication about BWMON versioning comes from downstream
sources, which I find not enough to justify usage of versions for blocks.

Therefore as per DT recommendation (which I am enforcing on others) I am
not planning put there bwmon-v4.

Best regards,
Krzysztof

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 4/4] arm64: dts: qcom: sdm845: Add CPU BWMON
  2022-06-22 11:46     ` Rajendra Nayak
@ 2022-06-22 13:52       ` Krzysztof Kozlowski
  -1 siblings, 0 replies; 52+ messages in thread
From: Krzysztof Kozlowski @ 2022-06-22 13:52 UTC (permalink / raw)
  To: Rajendra Nayak, Andy Gross, Bjorn Andersson, Georgi Djakov,
	Rob Herring, Catalin Marinas, Will Deacon, linux-arm-msm,
	linux-pm, devicetree, linux-kernel, linux-arm-kernel
  Cc: Thara Gopinath

On 22/06/2022 13:46, Rajendra Nayak wrote:
> 
> On 6/1/2022 3:41 PM, Krzysztof Kozlowski wrote:
>> Add device node for CPU-memory BWMON device (bandwidth monitoring) on
>> SDM845 measuring bandwidth between CPU (gladiator_noc) and Last Level
>> Cache (memnoc).  Usage of this BWMON allows to remove fixed bandwidth
>> votes from cpufreq (CPU nodes) thus achieve high memory throughput even
>> with lower CPU frequencies.
>>
>> Performance impact (SDM845-MTP RB3 board, linux next-20220422):
>> 1. No noticeable impact when running with schedutil or performance
>>     governors.
>>
>> 2. When comparing to customized kernel with synced interconnects and
>>     without bandwidth votes from CPU freq, the sysbench memory tests
>>     show significant improvement with bwmon for blocksizes past the L3
>>     cache.  The results for such superficial comparison:
>>
>> sysbench memory test, results in MB/s (higher is better)
>>   bs kB |  type |    V  | V+no bw votes | bwmon | benefit %
>>       1 | W/seq | 14795 |          4816 |  4985 |      3.5%
>>      64 | W/seq | 41987 |         10334 | 10433 |      1.0%
>>    4096 | W/seq | 29768 |          8728 | 32007 |    266.7%
>>   65536 | W/seq | 17711 |          4846 | 18399 |    279.6%
>> 262144 | W/seq | 16112 |          4538 | 17429 |    284.1%
>>      64 | R/seq | 61202 |         67092 | 66804 |     -0.4%
>>    4096 | R/seq | 23871 |          5458 | 24307 |    345.4%
>>   65536 | R/seq | 18554 |          4240 | 18685 |    340.7%
>> 262144 | R/seq | 17524 |          4207 | 17774 |    322.4%
>>      64 | W/rnd |  2663 |          1098 |  1119 |      1.9%
>>   65536 | W/rnd |   600 |           316 |   610 |     92.7%
>>      64 | R/rnd |  4915 |          4784 |  4594 |     -4.0%
>>   65536 | R/rnd |   664 |           281 |   678 |    140.7%
>>
>> Legend:
>> bs kB: block size in KB (small block size means only L1-3 caches are
>>        used
>> type: R - read, W - write, seq - sequential, rnd - random
>> V: vanilla (next-20220422)
>> V + no bw votes: vanilla without bandwidth votes from CPU freq
>> bwmon: bwmon without bandwidth votes from CPU freq
>> benefit %: difference between vanilla without bandwidth votes and bwmon
>>             (higher is better)
>>
>> Co-developed-by: Thara Gopinath <thara.gopinath@linaro.org>
>> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
>> Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
>> ---
>>   arch/arm64/boot/dts/qcom/sdm845.dtsi | 54 ++++++++++++++++++++++++++++
>>   1 file changed, 54 insertions(+)
>>
>> diff --git a/arch/arm64/boot/dts/qcom/sdm845.dtsi b/arch/arm64/boot/dts/qcom/sdm845.dtsi
>> index 83e8b63f0910..adffb9c70566 100644
>> --- a/arch/arm64/boot/dts/qcom/sdm845.dtsi
>> +++ b/arch/arm64/boot/dts/qcom/sdm845.dtsi
>> @@ -2026,6 +2026,60 @@ llcc: system-cache-controller@1100000 {
>>   			interrupts = <GIC_SPI 582 IRQ_TYPE_LEVEL_HIGH>;
>>   		};
>>   
>> +		pmu@1436400 {
>> +			compatible = "qcom,sdm845-cpu-bwmon";
>> +			reg = <0 0x01436400 0 0x600>;
>> +
>> +			interrupts = <GIC_SPI 581 IRQ_TYPE_LEVEL_HIGH>;
>> +
>> +			interconnects = <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>,
>> +					<&osm_l3 MASTER_OSM_L3_APPS &osm_l3 SLAVE_OSM_L3>;
>> +			interconnect-names = "ddr", "l3c";
> 
> Is this the pmu/bwmon instance between the cpu and caches or the one between the caches and DDR?

To my understanding this is the one between CPU and caches.

> Depending on which one it is, shouldn;t we just be scaling either one and not both the interconnect paths?

The interconnects are the same as ones used for CPU nodes, therefore if
we want to scale both when scaling CPU, then we also want to scale both
when seeing traffic between CPU and cache.

Maybe the assumption here is not correct, so basically the two
interconnects in CPU nodes are also not proper?


Best regards,
Krzysztof

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 4/4] arm64: dts: qcom: sdm845: Add CPU BWMON
@ 2022-06-22 13:52       ` Krzysztof Kozlowski
  0 siblings, 0 replies; 52+ messages in thread
From: Krzysztof Kozlowski @ 2022-06-22 13:52 UTC (permalink / raw)
  To: Rajendra Nayak, Andy Gross, Bjorn Andersson, Georgi Djakov,
	Rob Herring, Catalin Marinas, Will Deacon, linux-arm-msm,
	linux-pm, devicetree, linux-kernel, linux-arm-kernel
  Cc: Thara Gopinath

On 22/06/2022 13:46, Rajendra Nayak wrote:
> 
> On 6/1/2022 3:41 PM, Krzysztof Kozlowski wrote:
>> Add device node for CPU-memory BWMON device (bandwidth monitoring) on
>> SDM845 measuring bandwidth between CPU (gladiator_noc) and Last Level
>> Cache (memnoc).  Usage of this BWMON allows to remove fixed bandwidth
>> votes from cpufreq (CPU nodes) thus achieve high memory throughput even
>> with lower CPU frequencies.
>>
>> Performance impact (SDM845-MTP RB3 board, linux next-20220422):
>> 1. No noticeable impact when running with schedutil or performance
>>     governors.
>>
>> 2. When comparing to customized kernel with synced interconnects and
>>     without bandwidth votes from CPU freq, the sysbench memory tests
>>     show significant improvement with bwmon for blocksizes past the L3
>>     cache.  The results for such superficial comparison:
>>
>> sysbench memory test, results in MB/s (higher is better)
>>   bs kB |  type |    V  | V+no bw votes | bwmon | benefit %
>>       1 | W/seq | 14795 |          4816 |  4985 |      3.5%
>>      64 | W/seq | 41987 |         10334 | 10433 |      1.0%
>>    4096 | W/seq | 29768 |          8728 | 32007 |    266.7%
>>   65536 | W/seq | 17711 |          4846 | 18399 |    279.6%
>> 262144 | W/seq | 16112 |          4538 | 17429 |    284.1%
>>      64 | R/seq | 61202 |         67092 | 66804 |     -0.4%
>>    4096 | R/seq | 23871 |          5458 | 24307 |    345.4%
>>   65536 | R/seq | 18554 |          4240 | 18685 |    340.7%
>> 262144 | R/seq | 17524 |          4207 | 17774 |    322.4%
>>      64 | W/rnd |  2663 |          1098 |  1119 |      1.9%
>>   65536 | W/rnd |   600 |           316 |   610 |     92.7%
>>      64 | R/rnd |  4915 |          4784 |  4594 |     -4.0%
>>   65536 | R/rnd |   664 |           281 |   678 |    140.7%
>>
>> Legend:
>> bs kB: block size in KB (small block size means only L1-3 caches are
>>        used
>> type: R - read, W - write, seq - sequential, rnd - random
>> V: vanilla (next-20220422)
>> V + no bw votes: vanilla without bandwidth votes from CPU freq
>> bwmon: bwmon without bandwidth votes from CPU freq
>> benefit %: difference between vanilla without bandwidth votes and bwmon
>>             (higher is better)
>>
>> Co-developed-by: Thara Gopinath <thara.gopinath@linaro.org>
>> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
>> Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
>> ---
>>   arch/arm64/boot/dts/qcom/sdm845.dtsi | 54 ++++++++++++++++++++++++++++
>>   1 file changed, 54 insertions(+)
>>
>> diff --git a/arch/arm64/boot/dts/qcom/sdm845.dtsi b/arch/arm64/boot/dts/qcom/sdm845.dtsi
>> index 83e8b63f0910..adffb9c70566 100644
>> --- a/arch/arm64/boot/dts/qcom/sdm845.dtsi
>> +++ b/arch/arm64/boot/dts/qcom/sdm845.dtsi
>> @@ -2026,6 +2026,60 @@ llcc: system-cache-controller@1100000 {
>>   			interrupts = <GIC_SPI 582 IRQ_TYPE_LEVEL_HIGH>;
>>   		};
>>   
>> +		pmu@1436400 {
>> +			compatible = "qcom,sdm845-cpu-bwmon";
>> +			reg = <0 0x01436400 0 0x600>;
>> +
>> +			interrupts = <GIC_SPI 581 IRQ_TYPE_LEVEL_HIGH>;
>> +
>> +			interconnects = <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>,
>> +					<&osm_l3 MASTER_OSM_L3_APPS &osm_l3 SLAVE_OSM_L3>;
>> +			interconnect-names = "ddr", "l3c";
> 
> Is this the pmu/bwmon instance between the cpu and caches or the one between the caches and DDR?

To my understanding this is the one between CPU and caches.

> Depending on which one it is, shouldn;t we just be scaling either one and not both the interconnect paths?

The interconnects are the same as ones used for CPU nodes, therefore if
we want to scale both when scaling CPU, then we also want to scale both
when seeing traffic between CPU and cache.

Maybe the assumption here is not correct, so basically the two
interconnects in CPU nodes are also not proper?


Best regards,
Krzysztof

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 4/4] arm64: dts: qcom: sdm845: Add CPU BWMON
  2022-06-22 13:52       ` Krzysztof Kozlowski
@ 2022-06-23  6:48         ` Rajendra Nayak
  -1 siblings, 0 replies; 52+ messages in thread
From: Rajendra Nayak @ 2022-06-23  6:48 UTC (permalink / raw)
  To: Krzysztof Kozlowski, Andy Gross, Bjorn Andersson, Georgi Djakov,
	Rob Herring, Catalin Marinas, Will Deacon, linux-arm-msm,
	linux-pm, devicetree, linux-kernel, linux-arm-kernel
  Cc: Thara Gopinath


On 6/22/2022 7:22 PM, Krzysztof Kozlowski wrote:
> On 22/06/2022 13:46, Rajendra Nayak wrote:
>>
>> On 6/1/2022 3:41 PM, Krzysztof Kozlowski wrote:
>>> Add device node for CPU-memory BWMON device (bandwidth monitoring) on
>>> SDM845 measuring bandwidth between CPU (gladiator_noc) and Last Level
>>> Cache (memnoc).  Usage of this BWMON allows to remove fixed bandwidth
>>> votes from cpufreq (CPU nodes) thus achieve high memory throughput even
>>> with lower CPU frequencies.
>>>
>>> Performance impact (SDM845-MTP RB3 board, linux next-20220422):
>>> 1. No noticeable impact when running with schedutil or performance
>>>      governors.
>>>
>>> 2. When comparing to customized kernel with synced interconnects and
>>>      without bandwidth votes from CPU freq, the sysbench memory tests
>>>      show significant improvement with bwmon for blocksizes past the L3
>>>      cache.  The results for such superficial comparison:
>>>
>>> sysbench memory test, results in MB/s (higher is better)
>>>    bs kB |  type |    V  | V+no bw votes | bwmon | benefit %
>>>        1 | W/seq | 14795 |          4816 |  4985 |      3.5%
>>>       64 | W/seq | 41987 |         10334 | 10433 |      1.0%
>>>     4096 | W/seq | 29768 |          8728 | 32007 |    266.7%
>>>    65536 | W/seq | 17711 |          4846 | 18399 |    279.6%
>>> 262144 | W/seq | 16112 |          4538 | 17429 |    284.1%
>>>       64 | R/seq | 61202 |         67092 | 66804 |     -0.4%
>>>     4096 | R/seq | 23871 |          5458 | 24307 |    345.4%
>>>    65536 | R/seq | 18554 |          4240 | 18685 |    340.7%
>>> 262144 | R/seq | 17524 |          4207 | 17774 |    322.4%
>>>       64 | W/rnd |  2663 |          1098 |  1119 |      1.9%
>>>    65536 | W/rnd |   600 |           316 |   610 |     92.7%
>>>       64 | R/rnd |  4915 |          4784 |  4594 |     -4.0%
>>>    65536 | R/rnd |   664 |           281 |   678 |    140.7%
>>>
>>> Legend:
>>> bs kB: block size in KB (small block size means only L1-3 caches are
>>>         used
>>> type: R - read, W - write, seq - sequential, rnd - random
>>> V: vanilla (next-20220422)
>>> V + no bw votes: vanilla without bandwidth votes from CPU freq
>>> bwmon: bwmon without bandwidth votes from CPU freq
>>> benefit %: difference between vanilla without bandwidth votes and bwmon
>>>              (higher is better)
>>>
>>> Co-developed-by: Thara Gopinath <thara.gopinath@linaro.org>
>>> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
>>> Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
>>> ---
>>>    arch/arm64/boot/dts/qcom/sdm845.dtsi | 54 ++++++++++++++++++++++++++++
>>>    1 file changed, 54 insertions(+)
>>>
>>> diff --git a/arch/arm64/boot/dts/qcom/sdm845.dtsi b/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>> index 83e8b63f0910..adffb9c70566 100644
>>> --- a/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>> +++ b/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>> @@ -2026,6 +2026,60 @@ llcc: system-cache-controller@1100000 {
>>>    			interrupts = <GIC_SPI 582 IRQ_TYPE_LEVEL_HIGH>;
>>>    		};
>>>    
>>> +		pmu@1436400 {
>>> +			compatible = "qcom,sdm845-cpu-bwmon";
>>> +			reg = <0 0x01436400 0 0x600>;
>>> +
>>> +			interrupts = <GIC_SPI 581 IRQ_TYPE_LEVEL_HIGH>;
>>> +
>>> +			interconnects = <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>,
>>> +					<&osm_l3 MASTER_OSM_L3_APPS &osm_l3 SLAVE_OSM_L3>;
>>> +			interconnect-names = "ddr", "l3c";
>>
>> Is this the pmu/bwmon instance between the cpu and caches or the one between the caches and DDR?
> 
> To my understanding this is the one between CPU and caches.

Ok, but then because the OPP table lists the DDR bw first and Cache bw second, isn't the driver
ending up comparing the bw values thrown by the pmu against the DDR bw instead of the Cache BW?
Atleast with my testing on sc7280 I found this to mess things up and I always was ending up at
higher OPPs even while the system was completely idle. Comparing the values against the Cache bw
fixed it.(sc7280 also has a bwmon4 instance between the cpu and caches and a bwmon5 between the cache
and DDR)

> 
>> Depending on which one it is, shouldn;t we just be scaling either one and not both the interconnect paths?
> 
> The interconnects are the same as ones used for CPU nodes, therefore if
> we want to scale both when scaling CPU, then we also want to scale both
> when seeing traffic between CPU and cache.

Well, they were both associated with the CPU node because with no other input to decide on _when_
to scale the caches and DDR, we just put a mapping table which simply mapped a CPU freq to a L3 _and_
DDR freq. So with just one input (CPU freq) we decided on what should be both the L3 freq and DDR freq.

Now with 2 pmu's, we have 2 inputs, so we can individually scale the L3 based on the cache PMU
counters and DDR based on the DDR PMU counters, no?

Since you said you have plans to add the other pmu support as well (bwmon5 between the cache and DDR)
how else would you have the OPP table associated with that pmu instance? Would you again have both the
L3 and DDR scale based on the inputs from that bwmon too?

> 
> Maybe the assumption here is not correct, so basically the two
> interconnects in CPU nodes are also not proper?
> 
> 
> Best regards,
> Krzysztof

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 4/4] arm64: dts: qcom: sdm845: Add CPU BWMON
@ 2022-06-23  6:48         ` Rajendra Nayak
  0 siblings, 0 replies; 52+ messages in thread
From: Rajendra Nayak @ 2022-06-23  6:48 UTC (permalink / raw)
  To: Krzysztof Kozlowski, Andy Gross, Bjorn Andersson, Georgi Djakov,
	Rob Herring, Catalin Marinas, Will Deacon, linux-arm-msm,
	linux-pm, devicetree, linux-kernel, linux-arm-kernel
  Cc: Thara Gopinath


On 6/22/2022 7:22 PM, Krzysztof Kozlowski wrote:
> On 22/06/2022 13:46, Rajendra Nayak wrote:
>>
>> On 6/1/2022 3:41 PM, Krzysztof Kozlowski wrote:
>>> Add device node for CPU-memory BWMON device (bandwidth monitoring) on
>>> SDM845 measuring bandwidth between CPU (gladiator_noc) and Last Level
>>> Cache (memnoc).  Usage of this BWMON allows to remove fixed bandwidth
>>> votes from cpufreq (CPU nodes) thus achieve high memory throughput even
>>> with lower CPU frequencies.
>>>
>>> Performance impact (SDM845-MTP RB3 board, linux next-20220422):
>>> 1. No noticeable impact when running with schedutil or performance
>>>      governors.
>>>
>>> 2. When comparing to customized kernel with synced interconnects and
>>>      without bandwidth votes from CPU freq, the sysbench memory tests
>>>      show significant improvement with bwmon for blocksizes past the L3
>>>      cache.  The results for such superficial comparison:
>>>
>>> sysbench memory test, results in MB/s (higher is better)
>>>    bs kB |  type |    V  | V+no bw votes | bwmon | benefit %
>>>        1 | W/seq | 14795 |          4816 |  4985 |      3.5%
>>>       64 | W/seq | 41987 |         10334 | 10433 |      1.0%
>>>     4096 | W/seq | 29768 |          8728 | 32007 |    266.7%
>>>    65536 | W/seq | 17711 |          4846 | 18399 |    279.6%
>>> 262144 | W/seq | 16112 |          4538 | 17429 |    284.1%
>>>       64 | R/seq | 61202 |         67092 | 66804 |     -0.4%
>>>     4096 | R/seq | 23871 |          5458 | 24307 |    345.4%
>>>    65536 | R/seq | 18554 |          4240 | 18685 |    340.7%
>>> 262144 | R/seq | 17524 |          4207 | 17774 |    322.4%
>>>       64 | W/rnd |  2663 |          1098 |  1119 |      1.9%
>>>    65536 | W/rnd |   600 |           316 |   610 |     92.7%
>>>       64 | R/rnd |  4915 |          4784 |  4594 |     -4.0%
>>>    65536 | R/rnd |   664 |           281 |   678 |    140.7%
>>>
>>> Legend:
>>> bs kB: block size in KB (small block size means only L1-3 caches are
>>>         used
>>> type: R - read, W - write, seq - sequential, rnd - random
>>> V: vanilla (next-20220422)
>>> V + no bw votes: vanilla without bandwidth votes from CPU freq
>>> bwmon: bwmon without bandwidth votes from CPU freq
>>> benefit %: difference between vanilla without bandwidth votes and bwmon
>>>              (higher is better)
>>>
>>> Co-developed-by: Thara Gopinath <thara.gopinath@linaro.org>
>>> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
>>> Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
>>> ---
>>>    arch/arm64/boot/dts/qcom/sdm845.dtsi | 54 ++++++++++++++++++++++++++++
>>>    1 file changed, 54 insertions(+)
>>>
>>> diff --git a/arch/arm64/boot/dts/qcom/sdm845.dtsi b/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>> index 83e8b63f0910..adffb9c70566 100644
>>> --- a/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>> +++ b/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>> @@ -2026,6 +2026,60 @@ llcc: system-cache-controller@1100000 {
>>>    			interrupts = <GIC_SPI 582 IRQ_TYPE_LEVEL_HIGH>;
>>>    		};
>>>    
>>> +		pmu@1436400 {
>>> +			compatible = "qcom,sdm845-cpu-bwmon";
>>> +			reg = <0 0x01436400 0 0x600>;
>>> +
>>> +			interrupts = <GIC_SPI 581 IRQ_TYPE_LEVEL_HIGH>;
>>> +
>>> +			interconnects = <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>,
>>> +					<&osm_l3 MASTER_OSM_L3_APPS &osm_l3 SLAVE_OSM_L3>;
>>> +			interconnect-names = "ddr", "l3c";
>>
>> Is this the pmu/bwmon instance between the cpu and caches or the one between the caches and DDR?
> 
> To my understanding this is the one between CPU and caches.

Ok, but then because the OPP table lists the DDR bw first and Cache bw second, isn't the driver
ending up comparing the bw values thrown by the pmu against the DDR bw instead of the Cache BW?
Atleast with my testing on sc7280 I found this to mess things up and I always was ending up at
higher OPPs even while the system was completely idle. Comparing the values against the Cache bw
fixed it.(sc7280 also has a bwmon4 instance between the cpu and caches and a bwmon5 between the cache
and DDR)

> 
>> Depending on which one it is, shouldn;t we just be scaling either one and not both the interconnect paths?
> 
> The interconnects are the same as ones used for CPU nodes, therefore if
> we want to scale both when scaling CPU, then we also want to scale both
> when seeing traffic between CPU and cache.

Well, they were both associated with the CPU node because with no other input to decide on _when_
to scale the caches and DDR, we just put a mapping table which simply mapped a CPU freq to a L3 _and_
DDR freq. So with just one input (CPU freq) we decided on what should be both the L3 freq and DDR freq.

Now with 2 pmu's, we have 2 inputs, so we can individually scale the L3 based on the cache PMU
counters and DDR based on the DDR PMU counters, no?

Since you said you have plans to add the other pmu support as well (bwmon5 between the cache and DDR)
how else would you have the OPP table associated with that pmu instance? Would you again have both the
L3 and DDR scale based on the inputs from that bwmon too?

> 
> Maybe the assumption here is not correct, so basically the two
> interconnects in CPU nodes are also not proper?
> 
> 
> Best regards,
> Krzysztof

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 4/4] arm64: dts: qcom: sdm845: Add CPU BWMON
  2022-06-23  6:48         ` Rajendra Nayak
@ 2022-06-23 12:58           ` Krzysztof Kozlowski
  -1 siblings, 0 replies; 52+ messages in thread
From: Krzysztof Kozlowski @ 2022-06-23 12:58 UTC (permalink / raw)
  To: Rajendra Nayak, Andy Gross, Bjorn Andersson, Georgi Djakov,
	Rob Herring, Catalin Marinas, Will Deacon, linux-arm-msm,
	linux-pm, devicetree, linux-kernel, linux-arm-kernel
  Cc: Thara Gopinath

On 23/06/2022 08:48, Rajendra Nayak wrote:
>>>> diff --git a/arch/arm64/boot/dts/qcom/sdm845.dtsi b/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>> index 83e8b63f0910..adffb9c70566 100644
>>>> --- a/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>> +++ b/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>> @@ -2026,6 +2026,60 @@ llcc: system-cache-controller@1100000 {
>>>>    			interrupts = <GIC_SPI 582 IRQ_TYPE_LEVEL_HIGH>;
>>>>    		};
>>>>    
>>>> +		pmu@1436400 {
>>>> +			compatible = "qcom,sdm845-cpu-bwmon";
>>>> +			reg = <0 0x01436400 0 0x600>;
>>>> +
>>>> +			interrupts = <GIC_SPI 581 IRQ_TYPE_LEVEL_HIGH>;
>>>> +
>>>> +			interconnects = <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>,
>>>> +					<&osm_l3 MASTER_OSM_L3_APPS &osm_l3 SLAVE_OSM_L3>;
>>>> +			interconnect-names = "ddr", "l3c";
>>>
>>> Is this the pmu/bwmon instance between the cpu and caches or the one between the caches and DDR?
>>
>> To my understanding this is the one between CPU and caches.
> 
> Ok, but then because the OPP table lists the DDR bw first and Cache bw second, isn't the driver
> ending up comparing the bw values thrown by the pmu against the DDR bw instead of the Cache BW?

I double checked now and you're right.

> Atleast with my testing on sc7280 I found this to mess things up and I always was ending up at
> higher OPPs even while the system was completely idle. Comparing the values against the Cache bw
> fixed it.(sc7280 also has a bwmon4 instance between the cpu and caches and a bwmon5 between the cache
> and DDR)

In my case it exposes different issue - under performance. Somehow the
bwmon does not report bandwidth high enough to vote for high bandwidth.

After removing the DDR interconnect and bandwidth OPP values I have for:
sysbench --threads=8 --time=60 --memory-total-size=20T --test=memory
--memory-block-size=4M run

1. Vanilla: 29768 MB/s
2. Vanilla without CPU votes: 8728 MB/s
3. Previous bwmon (voting too high): 32007 MB/s
4. Fixed bwmon 24911 MB/s
Bwmon does not vote for maximum L3 speed:
bwmon report 9408 MB/s (thresholds set: <9216000 15052801>
)
osm l3 aggregate 14355 MBps -> 897 MHz, level 7, bw 14355 MBps

Maybe that's just problem with missing governor which would vote for
bandwidth rounding up or anticipating higher needs.

>>> Depending on which one it is, shouldn;t we just be scaling either one and not both the interconnect paths?
>>
>> The interconnects are the same as ones used for CPU nodes, therefore if
>> we want to scale both when scaling CPU, then we also want to scale both
>> when seeing traffic between CPU and cache.
> 
> Well, they were both associated with the CPU node because with no other input to decide on _when_
> to scale the caches and DDR, we just put a mapping table which simply mapped a CPU freq to a L3 _and_
> DDR freq. So with just one input (CPU freq) we decided on what should be both the L3 freq and DDR freq.
> 
> Now with 2 pmu's, we have 2 inputs, so we can individually scale the L3 based on the cache PMU
> counters and DDR based on the DDR PMU counters, no?
> 
> Since you said you have plans to add the other pmu support as well (bwmon5 between the cache and DDR)
> how else would you have the OPP table associated with that pmu instance? Would you again have both the
> L3 and DDR scale based on the inputs from that bwmon too?

Good point, thanks for sharing. I think you're right. I'll keep only the
l3c interconnect path.


Best regards,
Krzysztof

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 4/4] arm64: dts: qcom: sdm845: Add CPU BWMON
@ 2022-06-23 12:58           ` Krzysztof Kozlowski
  0 siblings, 0 replies; 52+ messages in thread
From: Krzysztof Kozlowski @ 2022-06-23 12:58 UTC (permalink / raw)
  To: Rajendra Nayak, Andy Gross, Bjorn Andersson, Georgi Djakov,
	Rob Herring, Catalin Marinas, Will Deacon, linux-arm-msm,
	linux-pm, devicetree, linux-kernel, linux-arm-kernel
  Cc: Thara Gopinath

On 23/06/2022 08:48, Rajendra Nayak wrote:
>>>> diff --git a/arch/arm64/boot/dts/qcom/sdm845.dtsi b/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>> index 83e8b63f0910..adffb9c70566 100644
>>>> --- a/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>> +++ b/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>> @@ -2026,6 +2026,60 @@ llcc: system-cache-controller@1100000 {
>>>>    			interrupts = <GIC_SPI 582 IRQ_TYPE_LEVEL_HIGH>;
>>>>    		};
>>>>    
>>>> +		pmu@1436400 {
>>>> +			compatible = "qcom,sdm845-cpu-bwmon";
>>>> +			reg = <0 0x01436400 0 0x600>;
>>>> +
>>>> +			interrupts = <GIC_SPI 581 IRQ_TYPE_LEVEL_HIGH>;
>>>> +
>>>> +			interconnects = <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>,
>>>> +					<&osm_l3 MASTER_OSM_L3_APPS &osm_l3 SLAVE_OSM_L3>;
>>>> +			interconnect-names = "ddr", "l3c";
>>>
>>> Is this the pmu/bwmon instance between the cpu and caches or the one between the caches and DDR?
>>
>> To my understanding this is the one between CPU and caches.
> 
> Ok, but then because the OPP table lists the DDR bw first and Cache bw second, isn't the driver
> ending up comparing the bw values thrown by the pmu against the DDR bw instead of the Cache BW?

I double checked now and you're right.

> Atleast with my testing on sc7280 I found this to mess things up and I always was ending up at
> higher OPPs even while the system was completely idle. Comparing the values against the Cache bw
> fixed it.(sc7280 also has a bwmon4 instance between the cpu and caches and a bwmon5 between the cache
> and DDR)

In my case it exposes different issue - under performance. Somehow the
bwmon does not report bandwidth high enough to vote for high bandwidth.

After removing the DDR interconnect and bandwidth OPP values I have for:
sysbench --threads=8 --time=60 --memory-total-size=20T --test=memory
--memory-block-size=4M run

1. Vanilla: 29768 MB/s
2. Vanilla without CPU votes: 8728 MB/s
3. Previous bwmon (voting too high): 32007 MB/s
4. Fixed bwmon 24911 MB/s
Bwmon does not vote for maximum L3 speed:
bwmon report 9408 MB/s (thresholds set: <9216000 15052801>
)
osm l3 aggregate 14355 MBps -> 897 MHz, level 7, bw 14355 MBps

Maybe that's just problem with missing governor which would vote for
bandwidth rounding up or anticipating higher needs.

>>> Depending on which one it is, shouldn;t we just be scaling either one and not both the interconnect paths?
>>
>> The interconnects are the same as ones used for CPU nodes, therefore if
>> we want to scale both when scaling CPU, then we also want to scale both
>> when seeing traffic between CPU and cache.
> 
> Well, they were both associated with the CPU node because with no other input to decide on _when_
> to scale the caches and DDR, we just put a mapping table which simply mapped a CPU freq to a L3 _and_
> DDR freq. So with just one input (CPU freq) we decided on what should be both the L3 freq and DDR freq.
> 
> Now with 2 pmu's, we have 2 inputs, so we can individually scale the L3 based on the cache PMU
> counters and DDR based on the DDR PMU counters, no?
> 
> Since you said you have plans to add the other pmu support as well (bwmon5 between the cache and DDR)
> how else would you have the OPP table associated with that pmu instance? Would you again have both the
> L3 and DDR scale based on the inputs from that bwmon too?

Good point, thanks for sharing. I think you're right. I'll keep only the
l3c interconnect path.


Best regards,
Krzysztof

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 1/4] dt-bindings: interconnect: qcom,sdm845-cpu-bwmon: add BWMON device
  2022-06-22 11:58         ` Rajendra Nayak
@ 2022-06-26  3:19           ` Bjorn Andersson
  -1 siblings, 0 replies; 52+ messages in thread
From: Bjorn Andersson @ 2022-06-26  3:19 UTC (permalink / raw)
  To: Rajendra Nayak
  Cc: Krzysztof Kozlowski, Andy Gross, Georgi Djakov, Rob Herring,
	Catalin Marinas, Will Deacon, linux-arm-msm, linux-pm,
	devicetree, linux-kernel, linux-arm-kernel, Rob Herring

On Wed 22 Jun 06:58 CDT 2022, Rajendra Nayak wrote:

> 
> 
> On 6/7/2022 12:20 PM, Krzysztof Kozlowski wrote:
> > On 06/06/2022 23:11, Bjorn Andersson wrote:
> > > On Wed 01 Jun 03:11 PDT 2022, Krzysztof Kozlowski wrote:
> > > 
> > > > Add bindings for the Qualcomm Bandwidth Monitor device providing
> > > > performance data on interconnects.  The bindings describe only BWMON
> > > > version 4, e.g. the instance on SDM845 between CPU and Last Level Cache
> > > > Controller.
> > > > 
> > > > Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
> > > > Reviewed-by: Rob Herring <robh@kernel.org>
> > > > Acked-by: Georgi Djakov <djakov@kernel.org>
> > > > ---
> > > >   .../interconnect/qcom,sdm845-cpu-bwmon.yaml   | 97 +++++++++++++++++++
> > > >   1 file changed, 97 insertions(+)
> > > >   create mode 100644 Documentation/devicetree/bindings/interconnect/qcom,sdm845-cpu-bwmon.yaml
> > > > 
> > > > diff --git a/Documentation/devicetree/bindings/interconnect/qcom,sdm845-cpu-bwmon.yaml b/Documentation/devicetree/bindings/interconnect/qcom,sdm845-cpu-bwmon.yaml
> > > > new file mode 100644
> > > > index 000000000000..8c82e06ee432
> > > > --- /dev/null
> > > > +++ b/Documentation/devicetree/bindings/interconnect/qcom,sdm845-cpu-bwmon.yaml
> > > > @@ -0,0 +1,97 @@
> > > > +# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
> > > > +%YAML 1.2
> > > > +---
> > > > +$id: http://devicetree.org/schemas/interconnect/qcom,sdm845-cpu-bwmon.yaml#
> > > > +$schema: http://devicetree.org/meta-schemas/core.yaml#
> > > > +
> > > > +title: Qualcomm Interconnect Bandwidth Monitor
> > > > +
> > > > +maintainers:
> > > > +  - Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
> > > > +
> > > > +description:
> > > > +  Bandwidth Monitor measures current throughput on buses between various NoC
> > > > +  fabrics and provides information when it crosses configured thresholds.
> > > > +
> > > > +properties:
> > > > +  compatible:
> > > > +    enum:
> > > > +      - qcom,sdm845-cpu-bwmon       # BWMON v4
> > > 
> > > It seems the thing that's called bwmon v4 is compatible with a number of
> > > different platforms, should we add a generic compatible to the binding
> > > as well, to avoid having to update the implementation for each SoC?
> > > 
> > > (I.e. "qcom,sdm845-cpu-bwmon", "qcom,bwmon-v4")
> 
> it seems pretty useful to have the "qcom,bwmon-v4" and "qcom,bwmon-v5"
> compatibles, I tried these patches on a sc7280 device which has a bwmon4
> between the cpu and caches (and also has a bwmon5 between the caches and DDR)
> and the driver works with zero changes.
> 

But does the '4' and '5' has a relation to the hardware? Or is just the
4th and 5th register layout supported by the downstream driver?

Regards,
Bjorn

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 1/4] dt-bindings: interconnect: qcom,sdm845-cpu-bwmon: add BWMON device
@ 2022-06-26  3:19           ` Bjorn Andersson
  0 siblings, 0 replies; 52+ messages in thread
From: Bjorn Andersson @ 2022-06-26  3:19 UTC (permalink / raw)
  To: Rajendra Nayak
  Cc: Krzysztof Kozlowski, Andy Gross, Georgi Djakov, Rob Herring,
	Catalin Marinas, Will Deacon, linux-arm-msm, linux-pm,
	devicetree, linux-kernel, linux-arm-kernel, Rob Herring

On Wed 22 Jun 06:58 CDT 2022, Rajendra Nayak wrote:

> 
> 
> On 6/7/2022 12:20 PM, Krzysztof Kozlowski wrote:
> > On 06/06/2022 23:11, Bjorn Andersson wrote:
> > > On Wed 01 Jun 03:11 PDT 2022, Krzysztof Kozlowski wrote:
> > > 
> > > > Add bindings for the Qualcomm Bandwidth Monitor device providing
> > > > performance data on interconnects.  The bindings describe only BWMON
> > > > version 4, e.g. the instance on SDM845 between CPU and Last Level Cache
> > > > Controller.
> > > > 
> > > > Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
> > > > Reviewed-by: Rob Herring <robh@kernel.org>
> > > > Acked-by: Georgi Djakov <djakov@kernel.org>
> > > > ---
> > > >   .../interconnect/qcom,sdm845-cpu-bwmon.yaml   | 97 +++++++++++++++++++
> > > >   1 file changed, 97 insertions(+)
> > > >   create mode 100644 Documentation/devicetree/bindings/interconnect/qcom,sdm845-cpu-bwmon.yaml
> > > > 
> > > > diff --git a/Documentation/devicetree/bindings/interconnect/qcom,sdm845-cpu-bwmon.yaml b/Documentation/devicetree/bindings/interconnect/qcom,sdm845-cpu-bwmon.yaml
> > > > new file mode 100644
> > > > index 000000000000..8c82e06ee432
> > > > --- /dev/null
> > > > +++ b/Documentation/devicetree/bindings/interconnect/qcom,sdm845-cpu-bwmon.yaml
> > > > @@ -0,0 +1,97 @@
> > > > +# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
> > > > +%YAML 1.2
> > > > +---
> > > > +$id: http://devicetree.org/schemas/interconnect/qcom,sdm845-cpu-bwmon.yaml#
> > > > +$schema: http://devicetree.org/meta-schemas/core.yaml#
> > > > +
> > > > +title: Qualcomm Interconnect Bandwidth Monitor
> > > > +
> > > > +maintainers:
> > > > +  - Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
> > > > +
> > > > +description:
> > > > +  Bandwidth Monitor measures current throughput on buses between various NoC
> > > > +  fabrics and provides information when it crosses configured thresholds.
> > > > +
> > > > +properties:
> > > > +  compatible:
> > > > +    enum:
> > > > +      - qcom,sdm845-cpu-bwmon       # BWMON v4
> > > 
> > > It seems the thing that's called bwmon v4 is compatible with a number of
> > > different platforms, should we add a generic compatible to the binding
> > > as well, to avoid having to update the implementation for each SoC?
> > > 
> > > (I.e. "qcom,sdm845-cpu-bwmon", "qcom,bwmon-v4")
> 
> it seems pretty useful to have the "qcom,bwmon-v4" and "qcom,bwmon-v5"
> compatibles, I tried these patches on a sc7280 device which has a bwmon4
> between the cpu and caches (and also has a bwmon5 between the caches and DDR)
> and the driver works with zero changes.
> 

But does the '4' and '5' has a relation to the hardware? Or is just the
4th and 5th register layout supported by the downstream driver?

Regards,
Bjorn

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 4/4] arm64: dts: qcom: sdm845: Add CPU BWMON
  2022-06-23 12:58           ` Krzysztof Kozlowski
@ 2022-06-26  3:28             ` Bjorn Andersson
  -1 siblings, 0 replies; 52+ messages in thread
From: Bjorn Andersson @ 2022-06-26  3:28 UTC (permalink / raw)
  To: Krzysztof Kozlowski
  Cc: Rajendra Nayak, Andy Gross, Georgi Djakov, Rob Herring,
	Catalin Marinas, Will Deacon, linux-arm-msm, linux-pm,
	devicetree, linux-kernel, linux-arm-kernel, Thara Gopinath

On Thu 23 Jun 07:58 CDT 2022, Krzysztof Kozlowski wrote:

> On 23/06/2022 08:48, Rajendra Nayak wrote:
> >>>> diff --git a/arch/arm64/boot/dts/qcom/sdm845.dtsi b/arch/arm64/boot/dts/qcom/sdm845.dtsi
> >>>> index 83e8b63f0910..adffb9c70566 100644
> >>>> --- a/arch/arm64/boot/dts/qcom/sdm845.dtsi
> >>>> +++ b/arch/arm64/boot/dts/qcom/sdm845.dtsi
> >>>> @@ -2026,6 +2026,60 @@ llcc: system-cache-controller@1100000 {
> >>>>    			interrupts = <GIC_SPI 582 IRQ_TYPE_LEVEL_HIGH>;
> >>>>    		};
> >>>>    
> >>>> +		pmu@1436400 {
> >>>> +			compatible = "qcom,sdm845-cpu-bwmon";
> >>>> +			reg = <0 0x01436400 0 0x600>;
> >>>> +
> >>>> +			interrupts = <GIC_SPI 581 IRQ_TYPE_LEVEL_HIGH>;
> >>>> +
> >>>> +			interconnects = <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>,
> >>>> +					<&osm_l3 MASTER_OSM_L3_APPS &osm_l3 SLAVE_OSM_L3>;
> >>>> +			interconnect-names = "ddr", "l3c";
> >>>
> >>> Is this the pmu/bwmon instance between the cpu and caches or the one between the caches and DDR?
> >>
> >> To my understanding this is the one between CPU and caches.
> > 
> > Ok, but then because the OPP table lists the DDR bw first and Cache bw second, isn't the driver
> > ending up comparing the bw values thrown by the pmu against the DDR bw instead of the Cache BW?
> 
> I double checked now and you're right.
> 
> > Atleast with my testing on sc7280 I found this to mess things up and I always was ending up at
> > higher OPPs even while the system was completely idle. Comparing the values against the Cache bw
> > fixed it.(sc7280 also has a bwmon4 instance between the cpu and caches and a bwmon5 between the cache
> > and DDR)
> 
> In my case it exposes different issue - under performance. Somehow the
> bwmon does not report bandwidth high enough to vote for high bandwidth.
> 
> After removing the DDR interconnect and bandwidth OPP values I have for:
> sysbench --threads=8 --time=60 --memory-total-size=20T --test=memory
> --memory-block-size=4M run
> 
> 1. Vanilla: 29768 MB/s
> 2. Vanilla without CPU votes: 8728 MB/s
> 3. Previous bwmon (voting too high): 32007 MB/s
> 4. Fixed bwmon 24911 MB/s
> Bwmon does not vote for maximum L3 speed:
> bwmon report 9408 MB/s (thresholds set: <9216000 15052801>
> )
> osm l3 aggregate 14355 MBps -> 897 MHz, level 7, bw 14355 MBps
> 
> Maybe that's just problem with missing governor which would vote for
> bandwidth rounding up or anticipating higher needs.
> 
> >>> Depending on which one it is, shouldn;t we just be scaling either one and not both the interconnect paths?
> >>
> >> The interconnects are the same as ones used for CPU nodes, therefore if
> >> we want to scale both when scaling CPU, then we also want to scale both
> >> when seeing traffic between CPU and cache.
> > 
> > Well, they were both associated with the CPU node because with no other input to decide on _when_
> > to scale the caches and DDR, we just put a mapping table which simply mapped a CPU freq to a L3 _and_
> > DDR freq. So with just one input (CPU freq) we decided on what should be both the L3 freq and DDR freq.
> > 
> > Now with 2 pmu's, we have 2 inputs, so we can individually scale the L3 based on the cache PMU
> > counters and DDR based on the DDR PMU counters, no?
> > 
> > Since you said you have plans to add the other pmu support as well (bwmon5 between the cache and DDR)
> > how else would you have the OPP table associated with that pmu instance? Would you again have both the
> > L3 and DDR scale based on the inputs from that bwmon too?
> 
> Good point, thanks for sharing. I think you're right. I'll keep only the
> l3c interconnect path.
> 

If I understand correctly, <&osm_l3 MASTER_OSM_L3_APPS &osm_l3
SLAVE_OSM_L3> relates to the L3 cache speed, which sits inside the CPU
subsystem. As such traffic hitting this cache will not show up in either
bwmon instance.

The path <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>
affects the DDR frequency. So the traffic measured by the cpu-bwmon
would be the CPU subsystems traffic that missed the L1/L2/L3 caches and
hits the memory bus towards DDR.


If this is the case it seems to make sense to keep the L3 scaling in the
opp-tables for the CPU and make bwmon only scale the DDR path. What do
you think?

Regards,
Bjorn

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 4/4] arm64: dts: qcom: sdm845: Add CPU BWMON
@ 2022-06-26  3:28             ` Bjorn Andersson
  0 siblings, 0 replies; 52+ messages in thread
From: Bjorn Andersson @ 2022-06-26  3:28 UTC (permalink / raw)
  To: Krzysztof Kozlowski
  Cc: Rajendra Nayak, Andy Gross, Georgi Djakov, Rob Herring,
	Catalin Marinas, Will Deacon, linux-arm-msm, linux-pm,
	devicetree, linux-kernel, linux-arm-kernel, Thara Gopinath

On Thu 23 Jun 07:58 CDT 2022, Krzysztof Kozlowski wrote:

> On 23/06/2022 08:48, Rajendra Nayak wrote:
> >>>> diff --git a/arch/arm64/boot/dts/qcom/sdm845.dtsi b/arch/arm64/boot/dts/qcom/sdm845.dtsi
> >>>> index 83e8b63f0910..adffb9c70566 100644
> >>>> --- a/arch/arm64/boot/dts/qcom/sdm845.dtsi
> >>>> +++ b/arch/arm64/boot/dts/qcom/sdm845.dtsi
> >>>> @@ -2026,6 +2026,60 @@ llcc: system-cache-controller@1100000 {
> >>>>    			interrupts = <GIC_SPI 582 IRQ_TYPE_LEVEL_HIGH>;
> >>>>    		};
> >>>>    
> >>>> +		pmu@1436400 {
> >>>> +			compatible = "qcom,sdm845-cpu-bwmon";
> >>>> +			reg = <0 0x01436400 0 0x600>;
> >>>> +
> >>>> +			interrupts = <GIC_SPI 581 IRQ_TYPE_LEVEL_HIGH>;
> >>>> +
> >>>> +			interconnects = <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>,
> >>>> +					<&osm_l3 MASTER_OSM_L3_APPS &osm_l3 SLAVE_OSM_L3>;
> >>>> +			interconnect-names = "ddr", "l3c";
> >>>
> >>> Is this the pmu/bwmon instance between the cpu and caches or the one between the caches and DDR?
> >>
> >> To my understanding this is the one between CPU and caches.
> > 
> > Ok, but then because the OPP table lists the DDR bw first and Cache bw second, isn't the driver
> > ending up comparing the bw values thrown by the pmu against the DDR bw instead of the Cache BW?
> 
> I double checked now and you're right.
> 
> > Atleast with my testing on sc7280 I found this to mess things up and I always was ending up at
> > higher OPPs even while the system was completely idle. Comparing the values against the Cache bw
> > fixed it.(sc7280 also has a bwmon4 instance between the cpu and caches and a bwmon5 between the cache
> > and DDR)
> 
> In my case it exposes different issue - under performance. Somehow the
> bwmon does not report bandwidth high enough to vote for high bandwidth.
> 
> After removing the DDR interconnect and bandwidth OPP values I have for:
> sysbench --threads=8 --time=60 --memory-total-size=20T --test=memory
> --memory-block-size=4M run
> 
> 1. Vanilla: 29768 MB/s
> 2. Vanilla without CPU votes: 8728 MB/s
> 3. Previous bwmon (voting too high): 32007 MB/s
> 4. Fixed bwmon 24911 MB/s
> Bwmon does not vote for maximum L3 speed:
> bwmon report 9408 MB/s (thresholds set: <9216000 15052801>
> )
> osm l3 aggregate 14355 MBps -> 897 MHz, level 7, bw 14355 MBps
> 
> Maybe that's just problem with missing governor which would vote for
> bandwidth rounding up or anticipating higher needs.
> 
> >>> Depending on which one it is, shouldn;t we just be scaling either one and not both the interconnect paths?
> >>
> >> The interconnects are the same as ones used for CPU nodes, therefore if
> >> we want to scale both when scaling CPU, then we also want to scale both
> >> when seeing traffic between CPU and cache.
> > 
> > Well, they were both associated with the CPU node because with no other input to decide on _when_
> > to scale the caches and DDR, we just put a mapping table which simply mapped a CPU freq to a L3 _and_
> > DDR freq. So with just one input (CPU freq) we decided on what should be both the L3 freq and DDR freq.
> > 
> > Now with 2 pmu's, we have 2 inputs, so we can individually scale the L3 based on the cache PMU
> > counters and DDR based on the DDR PMU counters, no?
> > 
> > Since you said you have plans to add the other pmu support as well (bwmon5 between the cache and DDR)
> > how else would you have the OPP table associated with that pmu instance? Would you again have both the
> > L3 and DDR scale based on the inputs from that bwmon too?
> 
> Good point, thanks for sharing. I think you're right. I'll keep only the
> l3c interconnect path.
> 

If I understand correctly, <&osm_l3 MASTER_OSM_L3_APPS &osm_l3
SLAVE_OSM_L3> relates to the L3 cache speed, which sits inside the CPU
subsystem. As such traffic hitting this cache will not show up in either
bwmon instance.

The path <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>
affects the DDR frequency. So the traffic measured by the cpu-bwmon
would be the CPU subsystems traffic that missed the L1/L2/L3 caches and
hits the memory bus towards DDR.


If this is the case it seems to make sense to keep the L3 scaling in the
opp-tables for the CPU and make bwmon only scale the DDR path. What do
you think?

Regards,
Bjorn

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 4/4] arm64: dts: qcom: sdm845: Add CPU BWMON
  2022-06-26  3:28             ` Bjorn Andersson
@ 2022-06-27 12:39               ` Krzysztof Kozlowski
  -1 siblings, 0 replies; 52+ messages in thread
From: Krzysztof Kozlowski @ 2022-06-27 12:39 UTC (permalink / raw)
  To: Bjorn Andersson
  Cc: Rajendra Nayak, Andy Gross, Georgi Djakov, Rob Herring,
	Catalin Marinas, Will Deacon, linux-arm-msm, linux-pm,
	devicetree, linux-kernel, linux-arm-kernel, Thara Gopinath

On 26/06/2022 05:28, Bjorn Andersson wrote:
> On Thu 23 Jun 07:58 CDT 2022, Krzysztof Kozlowski wrote:
> 
>> On 23/06/2022 08:48, Rajendra Nayak wrote:
>>>>>> diff --git a/arch/arm64/boot/dts/qcom/sdm845.dtsi b/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>>>> index 83e8b63f0910..adffb9c70566 100644
>>>>>> --- a/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>>>> +++ b/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>>>> @@ -2026,6 +2026,60 @@ llcc: system-cache-controller@1100000 {
>>>>>>    			interrupts = <GIC_SPI 582 IRQ_TYPE_LEVEL_HIGH>;
>>>>>>    		};
>>>>>>    
>>>>>> +		pmu@1436400 {
>>>>>> +			compatible = "qcom,sdm845-cpu-bwmon";
>>>>>> +			reg = <0 0x01436400 0 0x600>;
>>>>>> +
>>>>>> +			interrupts = <GIC_SPI 581 IRQ_TYPE_LEVEL_HIGH>;
>>>>>> +
>>>>>> +			interconnects = <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>,
>>>>>> +					<&osm_l3 MASTER_OSM_L3_APPS &osm_l3 SLAVE_OSM_L3>;
>>>>>> +			interconnect-names = "ddr", "l3c";
>>>>>
>>>>> Is this the pmu/bwmon instance between the cpu and caches or the one between the caches and DDR?
>>>>
>>>> To my understanding this is the one between CPU and caches.
>>>
>>> Ok, but then because the OPP table lists the DDR bw first and Cache bw second, isn't the driver
>>> ending up comparing the bw values thrown by the pmu against the DDR bw instead of the Cache BW?
>>
>> I double checked now and you're right.
>>
>>> Atleast with my testing on sc7280 I found this to mess things up and I always was ending up at
>>> higher OPPs even while the system was completely idle. Comparing the values against the Cache bw
>>> fixed it.(sc7280 also has a bwmon4 instance between the cpu and caches and a bwmon5 between the cache
>>> and DDR)
>>
>> In my case it exposes different issue - under performance. Somehow the
>> bwmon does not report bandwidth high enough to vote for high bandwidth.
>>
>> After removing the DDR interconnect and bandwidth OPP values I have for:
>> sysbench --threads=8 --time=60 --memory-total-size=20T --test=memory
>> --memory-block-size=4M run
>>
>> 1. Vanilla: 29768 MB/s
>> 2. Vanilla without CPU votes: 8728 MB/s
>> 3. Previous bwmon (voting too high): 32007 MB/s
>> 4. Fixed bwmon 24911 MB/s
>> Bwmon does not vote for maximum L3 speed:
>> bwmon report 9408 MB/s (thresholds set: <9216000 15052801>
>> )
>> osm l3 aggregate 14355 MBps -> 897 MHz, level 7, bw 14355 MBps
>>
>> Maybe that's just problem with missing governor which would vote for
>> bandwidth rounding up or anticipating higher needs.
>>
>>>>> Depending on which one it is, shouldn;t we just be scaling either one and not both the interconnect paths?
>>>>
>>>> The interconnects are the same as ones used for CPU nodes, therefore if
>>>> we want to scale both when scaling CPU, then we also want to scale both
>>>> when seeing traffic between CPU and cache.
>>>
>>> Well, they were both associated with the CPU node because with no other input to decide on _when_
>>> to scale the caches and DDR, we just put a mapping table which simply mapped a CPU freq to a L3 _and_
>>> DDR freq. So with just one input (CPU freq) we decided on what should be both the L3 freq and DDR freq.
>>>
>>> Now with 2 pmu's, we have 2 inputs, so we can individually scale the L3 based on the cache PMU
>>> counters and DDR based on the DDR PMU counters, no?
>>>
>>> Since you said you have plans to add the other pmu support as well (bwmon5 between the cache and DDR)
>>> how else would you have the OPP table associated with that pmu instance? Would you again have both the
>>> L3 and DDR scale based on the inputs from that bwmon too?
>>
>> Good point, thanks for sharing. I think you're right. I'll keep only the
>> l3c interconnect path.
>>
> 
> If I understand correctly, <&osm_l3 MASTER_OSM_L3_APPS &osm_l3
> SLAVE_OSM_L3> relates to the L3 cache speed, which sits inside the CPU
> subsystem. As such traffic hitting this cache will not show up in either
> bwmon instance.
> 
> The path <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>
> affects the DDR frequency. So the traffic measured by the cpu-bwmon
> would be the CPU subsystems traffic that missed the L1/L2/L3 caches and
> hits the memory bus towards DDR.
> 
> 
> If this is the case it seems to make sense to keep the L3 scaling in the
> opp-tables for the CPU and make bwmon only scale the DDR path. What do
> you think?

The reported data throughput by this bwmon instance is beyond the DDR
OPP table bandwidth, e.g.: 16-22 GB/s, so it seems it measures still
within cache controller, not the memory bus.

Best regards,
Krzysztof

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 4/4] arm64: dts: qcom: sdm845: Add CPU BWMON
@ 2022-06-27 12:39               ` Krzysztof Kozlowski
  0 siblings, 0 replies; 52+ messages in thread
From: Krzysztof Kozlowski @ 2022-06-27 12:39 UTC (permalink / raw)
  To: Bjorn Andersson
  Cc: Rajendra Nayak, Andy Gross, Georgi Djakov, Rob Herring,
	Catalin Marinas, Will Deacon, linux-arm-msm, linux-pm,
	devicetree, linux-kernel, linux-arm-kernel, Thara Gopinath

On 26/06/2022 05:28, Bjorn Andersson wrote:
> On Thu 23 Jun 07:58 CDT 2022, Krzysztof Kozlowski wrote:
> 
>> On 23/06/2022 08:48, Rajendra Nayak wrote:
>>>>>> diff --git a/arch/arm64/boot/dts/qcom/sdm845.dtsi b/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>>>> index 83e8b63f0910..adffb9c70566 100644
>>>>>> --- a/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>>>> +++ b/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>>>> @@ -2026,6 +2026,60 @@ llcc: system-cache-controller@1100000 {
>>>>>>    			interrupts = <GIC_SPI 582 IRQ_TYPE_LEVEL_HIGH>;
>>>>>>    		};
>>>>>>    
>>>>>> +		pmu@1436400 {
>>>>>> +			compatible = "qcom,sdm845-cpu-bwmon";
>>>>>> +			reg = <0 0x01436400 0 0x600>;
>>>>>> +
>>>>>> +			interrupts = <GIC_SPI 581 IRQ_TYPE_LEVEL_HIGH>;
>>>>>> +
>>>>>> +			interconnects = <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>,
>>>>>> +					<&osm_l3 MASTER_OSM_L3_APPS &osm_l3 SLAVE_OSM_L3>;
>>>>>> +			interconnect-names = "ddr", "l3c";
>>>>>
>>>>> Is this the pmu/bwmon instance between the cpu and caches or the one between the caches and DDR?
>>>>
>>>> To my understanding this is the one between CPU and caches.
>>>
>>> Ok, but then because the OPP table lists the DDR bw first and Cache bw second, isn't the driver
>>> ending up comparing the bw values thrown by the pmu against the DDR bw instead of the Cache BW?
>>
>> I double checked now and you're right.
>>
>>> Atleast with my testing on sc7280 I found this to mess things up and I always was ending up at
>>> higher OPPs even while the system was completely idle. Comparing the values against the Cache bw
>>> fixed it.(sc7280 also has a bwmon4 instance between the cpu and caches and a bwmon5 between the cache
>>> and DDR)
>>
>> In my case it exposes different issue - under performance. Somehow the
>> bwmon does not report bandwidth high enough to vote for high bandwidth.
>>
>> After removing the DDR interconnect and bandwidth OPP values I have for:
>> sysbench --threads=8 --time=60 --memory-total-size=20T --test=memory
>> --memory-block-size=4M run
>>
>> 1. Vanilla: 29768 MB/s
>> 2. Vanilla without CPU votes: 8728 MB/s
>> 3. Previous bwmon (voting too high): 32007 MB/s
>> 4. Fixed bwmon 24911 MB/s
>> Bwmon does not vote for maximum L3 speed:
>> bwmon report 9408 MB/s (thresholds set: <9216000 15052801>
>> )
>> osm l3 aggregate 14355 MBps -> 897 MHz, level 7, bw 14355 MBps
>>
>> Maybe that's just problem with missing governor which would vote for
>> bandwidth rounding up or anticipating higher needs.
>>
>>>>> Depending on which one it is, shouldn;t we just be scaling either one and not both the interconnect paths?
>>>>
>>>> The interconnects are the same as ones used for CPU nodes, therefore if
>>>> we want to scale both when scaling CPU, then we also want to scale both
>>>> when seeing traffic between CPU and cache.
>>>
>>> Well, they were both associated with the CPU node because with no other input to decide on _when_
>>> to scale the caches and DDR, we just put a mapping table which simply mapped a CPU freq to a L3 _and_
>>> DDR freq. So with just one input (CPU freq) we decided on what should be both the L3 freq and DDR freq.
>>>
>>> Now with 2 pmu's, we have 2 inputs, so we can individually scale the L3 based on the cache PMU
>>> counters and DDR based on the DDR PMU counters, no?
>>>
>>> Since you said you have plans to add the other pmu support as well (bwmon5 between the cache and DDR)
>>> how else would you have the OPP table associated with that pmu instance? Would you again have both the
>>> L3 and DDR scale based on the inputs from that bwmon too?
>>
>> Good point, thanks for sharing. I think you're right. I'll keep only the
>> l3c interconnect path.
>>
> 
> If I understand correctly, <&osm_l3 MASTER_OSM_L3_APPS &osm_l3
> SLAVE_OSM_L3> relates to the L3 cache speed, which sits inside the CPU
> subsystem. As such traffic hitting this cache will not show up in either
> bwmon instance.
> 
> The path <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>
> affects the DDR frequency. So the traffic measured by the cpu-bwmon
> would be the CPU subsystems traffic that missed the L1/L2/L3 caches and
> hits the memory bus towards DDR.
> 
> 
> If this is the case it seems to make sense to keep the L3 scaling in the
> opp-tables for the CPU and make bwmon only scale the DDR path. What do
> you think?

The reported data throughput by this bwmon instance is beyond the DDR
OPP table bandwidth, e.g.: 16-22 GB/s, so it seems it measures still
within cache controller, not the memory bus.

Best regards,
Krzysztof

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 4/4] arm64: dts: qcom: sdm845: Add CPU BWMON
  2022-06-27 12:39               ` Krzysztof Kozlowski
@ 2022-06-28 10:36                 ` Rajendra Nayak
  -1 siblings, 0 replies; 52+ messages in thread
From: Rajendra Nayak @ 2022-06-28 10:36 UTC (permalink / raw)
  To: Krzysztof Kozlowski, Bjorn Andersson
  Cc: Andy Gross, Georgi Djakov, Rob Herring, Catalin Marinas,
	Will Deacon, linux-arm-msm, linux-pm, devicetree, linux-kernel,
	linux-arm-kernel, Thara Gopinath


On 6/27/2022 6:09 PM, Krzysztof Kozlowski wrote:
> On 26/06/2022 05:28, Bjorn Andersson wrote:
>> On Thu 23 Jun 07:58 CDT 2022, Krzysztof Kozlowski wrote:
>>
>>> On 23/06/2022 08:48, Rajendra Nayak wrote:
>>>>>>> diff --git a/arch/arm64/boot/dts/qcom/sdm845.dtsi b/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>>>>> index 83e8b63f0910..adffb9c70566 100644
>>>>>>> --- a/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>>>>> +++ b/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>>>>> @@ -2026,6 +2026,60 @@ llcc: system-cache-controller@1100000 {
>>>>>>>     			interrupts = <GIC_SPI 582 IRQ_TYPE_LEVEL_HIGH>;
>>>>>>>     		};
>>>>>>>     
>>>>>>> +		pmu@1436400 {
>>>>>>> +			compatible = "qcom,sdm845-cpu-bwmon";
>>>>>>> +			reg = <0 0x01436400 0 0x600>;
>>>>>>> +
>>>>>>> +			interrupts = <GIC_SPI 581 IRQ_TYPE_LEVEL_HIGH>;
>>>>>>> +
>>>>>>> +			interconnects = <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>,
>>>>>>> +					<&osm_l3 MASTER_OSM_L3_APPS &osm_l3 SLAVE_OSM_L3>;
>>>>>>> +			interconnect-names = "ddr", "l3c";
>>>>>>
>>>>>> Is this the pmu/bwmon instance between the cpu and caches or the one between the caches and DDR?
>>>>>
>>>>> To my understanding this is the one between CPU and caches.
>>>>
>>>> Ok, but then because the OPP table lists the DDR bw first and Cache bw second, isn't the driver
>>>> ending up comparing the bw values thrown by the pmu against the DDR bw instead of the Cache BW?
>>>
>>> I double checked now and you're right.
>>>
>>>> Atleast with my testing on sc7280 I found this to mess things up and I always was ending up at
>>>> higher OPPs even while the system was completely idle. Comparing the values against the Cache bw
>>>> fixed it.(sc7280 also has a bwmon4 instance between the cpu and caches and a bwmon5 between the cache
>>>> and DDR)
>>>
>>> In my case it exposes different issue - under performance. Somehow the
>>> bwmon does not report bandwidth high enough to vote for high bandwidth.
>>>
>>> After removing the DDR interconnect and bandwidth OPP values I have for:
>>> sysbench --threads=8 --time=60 --memory-total-size=20T --test=memory
>>> --memory-block-size=4M run
>>>
>>> 1. Vanilla: 29768 MB/s
>>> 2. Vanilla without CPU votes: 8728 MB/s
>>> 3. Previous bwmon (voting too high): 32007 MB/s
>>> 4. Fixed bwmon 24911 MB/s
>>> Bwmon does not vote for maximum L3 speed:
>>> bwmon report 9408 MB/s (thresholds set: <9216000 15052801>
>>> )
>>> osm l3 aggregate 14355 MBps -> 897 MHz, level 7, bw 14355 MBps
>>>
>>> Maybe that's just problem with missing governor which would vote for
>>> bandwidth rounding up or anticipating higher needs.
>>>
>>>>>> Depending on which one it is, shouldn;t we just be scaling either one and not both the interconnect paths?
>>>>>
>>>>> The interconnects are the same as ones used for CPU nodes, therefore if
>>>>> we want to scale both when scaling CPU, then we also want to scale both
>>>>> when seeing traffic between CPU and cache.
>>>>
>>>> Well, they were both associated with the CPU node because with no other input to decide on _when_
>>>> to scale the caches and DDR, we just put a mapping table which simply mapped a CPU freq to a L3 _and_
>>>> DDR freq. So with just one input (CPU freq) we decided on what should be both the L3 freq and DDR freq.
>>>>
>>>> Now with 2 pmu's, we have 2 inputs, so we can individually scale the L3 based on the cache PMU
>>>> counters and DDR based on the DDR PMU counters, no?
>>>>
>>>> Since you said you have plans to add the other pmu support as well (bwmon5 between the cache and DDR)
>>>> how else would you have the OPP table associated with that pmu instance? Would you again have both the
>>>> L3 and DDR scale based on the inputs from that bwmon too?
>>>
>>> Good point, thanks for sharing. I think you're right. I'll keep only the
>>> l3c interconnect path.
>>>
>>
>> If I understand correctly, <&osm_l3 MASTER_OSM_L3_APPS &osm_l3
>> SLAVE_OSM_L3> relates to the L3 cache speed, which sits inside the CPU
>> subsystem. As such traffic hitting this cache will not show up in either
>> bwmon instance.
>>
>> The path <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>
>> affects the DDR frequency. So the traffic measured by the cpu-bwmon
>> would be the CPU subsystems traffic that missed the L1/L2/L3 caches and
>> hits the memory bus towards DDR.

That seems right, looking some more into the downstream code and register definitions
I see the 2 bwmon instances actually lie on the path outside CPU SS towards DDR,
first one (bwmon4) is between the CPUSS and LLCC (system cache) and the second one
(bwmon5) between LLCC and DDR. So we should use the counters from bwmon4 to
scale the CPU-LLCC path (and not L3), on sc7280 that would mean splitting the
<&gem_noc MASTER_APPSS_PROC 3 &mc_virt SLAVE_EBI1 3> into
<&gem_noc MASTER_APPSS_PROC 3 &gem_noc SLAVE_LLCC 3> (voting based on the bwmon4 inputs)
and <&mc_virt MASTER_LLCC 3 &mc_virt SLAVE_EBI1 3> (voting based on the bwmon5 inputs)
and similar for sdm845 too.

L3 should perhaps still be voted based on the cpu freq as done today.

>> If this is the case it seems to make sense to keep the L3 scaling in the
>> opp-tables for the CPU and make bwmon only scale the DDR path. What do
>> you think?
> 
> The reported data throughput by this bwmon instance is beyond the DDR
> OPP table bandwidth, e.g.: 16-22 GB/s, so it seems it measures still
> within cache controller, not the memory bus.
> 
> Best regards,
> Krzysztof

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 4/4] arm64: dts: qcom: sdm845: Add CPU BWMON
@ 2022-06-28 10:36                 ` Rajendra Nayak
  0 siblings, 0 replies; 52+ messages in thread
From: Rajendra Nayak @ 2022-06-28 10:36 UTC (permalink / raw)
  To: Krzysztof Kozlowski, Bjorn Andersson
  Cc: Andy Gross, Georgi Djakov, Rob Herring, Catalin Marinas,
	Will Deacon, linux-arm-msm, linux-pm, devicetree, linux-kernel,
	linux-arm-kernel, Thara Gopinath


On 6/27/2022 6:09 PM, Krzysztof Kozlowski wrote:
> On 26/06/2022 05:28, Bjorn Andersson wrote:
>> On Thu 23 Jun 07:58 CDT 2022, Krzysztof Kozlowski wrote:
>>
>>> On 23/06/2022 08:48, Rajendra Nayak wrote:
>>>>>>> diff --git a/arch/arm64/boot/dts/qcom/sdm845.dtsi b/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>>>>> index 83e8b63f0910..adffb9c70566 100644
>>>>>>> --- a/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>>>>> +++ b/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>>>>> @@ -2026,6 +2026,60 @@ llcc: system-cache-controller@1100000 {
>>>>>>>     			interrupts = <GIC_SPI 582 IRQ_TYPE_LEVEL_HIGH>;
>>>>>>>     		};
>>>>>>>     
>>>>>>> +		pmu@1436400 {
>>>>>>> +			compatible = "qcom,sdm845-cpu-bwmon";
>>>>>>> +			reg = <0 0x01436400 0 0x600>;
>>>>>>> +
>>>>>>> +			interrupts = <GIC_SPI 581 IRQ_TYPE_LEVEL_HIGH>;
>>>>>>> +
>>>>>>> +			interconnects = <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>,
>>>>>>> +					<&osm_l3 MASTER_OSM_L3_APPS &osm_l3 SLAVE_OSM_L3>;
>>>>>>> +			interconnect-names = "ddr", "l3c";
>>>>>>
>>>>>> Is this the pmu/bwmon instance between the cpu and caches or the one between the caches and DDR?
>>>>>
>>>>> To my understanding this is the one between CPU and caches.
>>>>
>>>> Ok, but then because the OPP table lists the DDR bw first and Cache bw second, isn't the driver
>>>> ending up comparing the bw values thrown by the pmu against the DDR bw instead of the Cache BW?
>>>
>>> I double checked now and you're right.
>>>
>>>> Atleast with my testing on sc7280 I found this to mess things up and I always was ending up at
>>>> higher OPPs even while the system was completely idle. Comparing the values against the Cache bw
>>>> fixed it.(sc7280 also has a bwmon4 instance between the cpu and caches and a bwmon5 between the cache
>>>> and DDR)
>>>
>>> In my case it exposes different issue - under performance. Somehow the
>>> bwmon does not report bandwidth high enough to vote for high bandwidth.
>>>
>>> After removing the DDR interconnect and bandwidth OPP values I have for:
>>> sysbench --threads=8 --time=60 --memory-total-size=20T --test=memory
>>> --memory-block-size=4M run
>>>
>>> 1. Vanilla: 29768 MB/s
>>> 2. Vanilla without CPU votes: 8728 MB/s
>>> 3. Previous bwmon (voting too high): 32007 MB/s
>>> 4. Fixed bwmon 24911 MB/s
>>> Bwmon does not vote for maximum L3 speed:
>>> bwmon report 9408 MB/s (thresholds set: <9216000 15052801>
>>> )
>>> osm l3 aggregate 14355 MBps -> 897 MHz, level 7, bw 14355 MBps
>>>
>>> Maybe that's just problem with missing governor which would vote for
>>> bandwidth rounding up or anticipating higher needs.
>>>
>>>>>> Depending on which one it is, shouldn;t we just be scaling either one and not both the interconnect paths?
>>>>>
>>>>> The interconnects are the same as ones used for CPU nodes, therefore if
>>>>> we want to scale both when scaling CPU, then we also want to scale both
>>>>> when seeing traffic between CPU and cache.
>>>>
>>>> Well, they were both associated with the CPU node because with no other input to decide on _when_
>>>> to scale the caches and DDR, we just put a mapping table which simply mapped a CPU freq to a L3 _and_
>>>> DDR freq. So with just one input (CPU freq) we decided on what should be both the L3 freq and DDR freq.
>>>>
>>>> Now with 2 pmu's, we have 2 inputs, so we can individually scale the L3 based on the cache PMU
>>>> counters and DDR based on the DDR PMU counters, no?
>>>>
>>>> Since you said you have plans to add the other pmu support as well (bwmon5 between the cache and DDR)
>>>> how else would you have the OPP table associated with that pmu instance? Would you again have both the
>>>> L3 and DDR scale based on the inputs from that bwmon too?
>>>
>>> Good point, thanks for sharing. I think you're right. I'll keep only the
>>> l3c interconnect path.
>>>
>>
>> If I understand correctly, <&osm_l3 MASTER_OSM_L3_APPS &osm_l3
>> SLAVE_OSM_L3> relates to the L3 cache speed, which sits inside the CPU
>> subsystem. As such traffic hitting this cache will not show up in either
>> bwmon instance.
>>
>> The path <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>
>> affects the DDR frequency. So the traffic measured by the cpu-bwmon
>> would be the CPU subsystems traffic that missed the L1/L2/L3 caches and
>> hits the memory bus towards DDR.

That seems right, looking some more into the downstream code and register definitions
I see the 2 bwmon instances actually lie on the path outside CPU SS towards DDR,
first one (bwmon4) is between the CPUSS and LLCC (system cache) and the second one
(bwmon5) between LLCC and DDR. So we should use the counters from bwmon4 to
scale the CPU-LLCC path (and not L3), on sc7280 that would mean splitting the
<&gem_noc MASTER_APPSS_PROC 3 &mc_virt SLAVE_EBI1 3> into
<&gem_noc MASTER_APPSS_PROC 3 &gem_noc SLAVE_LLCC 3> (voting based on the bwmon4 inputs)
and <&mc_virt MASTER_LLCC 3 &mc_virt SLAVE_EBI1 3> (voting based on the bwmon5 inputs)
and similar for sdm845 too.

L3 should perhaps still be voted based on the cpu freq as done today.

>> If this is the case it seems to make sense to keep the L3 scaling in the
>> opp-tables for the CPU and make bwmon only scale the DDR path. What do
>> you think?
> 
> The reported data throughput by this bwmon instance is beyond the DDR
> OPP table bandwidth, e.g.: 16-22 GB/s, so it seems it measures still
> within cache controller, not the memory bus.
> 
> Best regards,
> Krzysztof

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 1/4] dt-bindings: interconnect: qcom,sdm845-cpu-bwmon: add BWMON device
  2022-06-26  3:19           ` Bjorn Andersson
@ 2022-06-28 10:43             ` Rajendra Nayak
  -1 siblings, 0 replies; 52+ messages in thread
From: Rajendra Nayak @ 2022-06-28 10:43 UTC (permalink / raw)
  To: Bjorn Andersson
  Cc: Krzysztof Kozlowski, Andy Gross, Georgi Djakov, Rob Herring,
	Catalin Marinas, Will Deacon, linux-arm-msm, linux-pm,
	devicetree, linux-kernel, linux-arm-kernel, Rob Herring



On 6/26/2022 8:49 AM, Bjorn Andersson wrote:
> On Wed 22 Jun 06:58 CDT 2022, Rajendra Nayak wrote:
> 
>>
>>
>> On 6/7/2022 12:20 PM, Krzysztof Kozlowski wrote:
>>> On 06/06/2022 23:11, Bjorn Andersson wrote:
>>>> On Wed 01 Jun 03:11 PDT 2022, Krzysztof Kozlowski wrote:
>>>>
>>>>> Add bindings for the Qualcomm Bandwidth Monitor device providing
>>>>> performance data on interconnects.  The bindings describe only BWMON
>>>>> version 4, e.g. the instance on SDM845 between CPU and Last Level Cache
>>>>> Controller.
>>>>>
>>>>> Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
>>>>> Reviewed-by: Rob Herring <robh@kernel.org>
>>>>> Acked-by: Georgi Djakov <djakov@kernel.org>
>>>>> ---
>>>>>    .../interconnect/qcom,sdm845-cpu-bwmon.yaml   | 97 +++++++++++++++++++
>>>>>    1 file changed, 97 insertions(+)
>>>>>    create mode 100644 Documentation/devicetree/bindings/interconnect/qcom,sdm845-cpu-bwmon.yaml
>>>>>
>>>>> diff --git a/Documentation/devicetree/bindings/interconnect/qcom,sdm845-cpu-bwmon.yaml b/Documentation/devicetree/bindings/interconnect/qcom,sdm845-cpu-bwmon.yaml
>>>>> new file mode 100644
>>>>> index 000000000000..8c82e06ee432
>>>>> --- /dev/null
>>>>> +++ b/Documentation/devicetree/bindings/interconnect/qcom,sdm845-cpu-bwmon.yaml
>>>>> @@ -0,0 +1,97 @@
>>>>> +# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
>>>>> +%YAML 1.2
>>>>> +---
>>>>> +$id: http://devicetree.org/schemas/interconnect/qcom,sdm845-cpu-bwmon.yaml#
>>>>> +$schema: http://devicetree.org/meta-schemas/core.yaml#
>>>>> +
>>>>> +title: Qualcomm Interconnect Bandwidth Monitor
>>>>> +
>>>>> +maintainers:
>>>>> +  - Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
>>>>> +
>>>>> +description:
>>>>> +  Bandwidth Monitor measures current throughput on buses between various NoC
>>>>> +  fabrics and provides information when it crosses configured thresholds.
>>>>> +
>>>>> +properties:
>>>>> +  compatible:
>>>>> +    enum:
>>>>> +      - qcom,sdm845-cpu-bwmon       # BWMON v4
>>>>
>>>> It seems the thing that's called bwmon v4 is compatible with a number of
>>>> different platforms, should we add a generic compatible to the binding
>>>> as well, to avoid having to update the implementation for each SoC?
>>>>
>>>> (I.e. "qcom,sdm845-cpu-bwmon", "qcom,bwmon-v4")
>>
>> it seems pretty useful to have the "qcom,bwmon-v4" and "qcom,bwmon-v5"
>> compatibles, I tried these patches on a sc7280 device which has a bwmon4
>> between the cpu and caches (and also has a bwmon5 between the caches and DDR)
>> and the driver works with zero changes.
>>
> 
> But does the '4' and '5' has a relation to the hardware? Or is just the
> 4th and 5th register layout supported by the downstream driver?

Right, it was just based on the downstream driver register layouts, i could not
find these numbers in HW specs anywhere, but that said I do see 2 instances of
these, one of them called the LAGG bwmon which is the one between the LLCC and DDR
and is documented as part of the LLCC specs. I'll try and dig somemore into the
documentation to see how we could define compatibles to match hw revs.

> 
> Regards,
> Bjorn

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 1/4] dt-bindings: interconnect: qcom,sdm845-cpu-bwmon: add BWMON device
@ 2022-06-28 10:43             ` Rajendra Nayak
  0 siblings, 0 replies; 52+ messages in thread
From: Rajendra Nayak @ 2022-06-28 10:43 UTC (permalink / raw)
  To: Bjorn Andersson
  Cc: Krzysztof Kozlowski, Andy Gross, Georgi Djakov, Rob Herring,
	Catalin Marinas, Will Deacon, linux-arm-msm, linux-pm,
	devicetree, linux-kernel, linux-arm-kernel, Rob Herring



On 6/26/2022 8:49 AM, Bjorn Andersson wrote:
> On Wed 22 Jun 06:58 CDT 2022, Rajendra Nayak wrote:
> 
>>
>>
>> On 6/7/2022 12:20 PM, Krzysztof Kozlowski wrote:
>>> On 06/06/2022 23:11, Bjorn Andersson wrote:
>>>> On Wed 01 Jun 03:11 PDT 2022, Krzysztof Kozlowski wrote:
>>>>
>>>>> Add bindings for the Qualcomm Bandwidth Monitor device providing
>>>>> performance data on interconnects.  The bindings describe only BWMON
>>>>> version 4, e.g. the instance on SDM845 between CPU and Last Level Cache
>>>>> Controller.
>>>>>
>>>>> Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
>>>>> Reviewed-by: Rob Herring <robh@kernel.org>
>>>>> Acked-by: Georgi Djakov <djakov@kernel.org>
>>>>> ---
>>>>>    .../interconnect/qcom,sdm845-cpu-bwmon.yaml   | 97 +++++++++++++++++++
>>>>>    1 file changed, 97 insertions(+)
>>>>>    create mode 100644 Documentation/devicetree/bindings/interconnect/qcom,sdm845-cpu-bwmon.yaml
>>>>>
>>>>> diff --git a/Documentation/devicetree/bindings/interconnect/qcom,sdm845-cpu-bwmon.yaml b/Documentation/devicetree/bindings/interconnect/qcom,sdm845-cpu-bwmon.yaml
>>>>> new file mode 100644
>>>>> index 000000000000..8c82e06ee432
>>>>> --- /dev/null
>>>>> +++ b/Documentation/devicetree/bindings/interconnect/qcom,sdm845-cpu-bwmon.yaml
>>>>> @@ -0,0 +1,97 @@
>>>>> +# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
>>>>> +%YAML 1.2
>>>>> +---
>>>>> +$id: http://devicetree.org/schemas/interconnect/qcom,sdm845-cpu-bwmon.yaml#
>>>>> +$schema: http://devicetree.org/meta-schemas/core.yaml#
>>>>> +
>>>>> +title: Qualcomm Interconnect Bandwidth Monitor
>>>>> +
>>>>> +maintainers:
>>>>> +  - Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
>>>>> +
>>>>> +description:
>>>>> +  Bandwidth Monitor measures current throughput on buses between various NoC
>>>>> +  fabrics and provides information when it crosses configured thresholds.
>>>>> +
>>>>> +properties:
>>>>> +  compatible:
>>>>> +    enum:
>>>>> +      - qcom,sdm845-cpu-bwmon       # BWMON v4
>>>>
>>>> It seems the thing that's called bwmon v4 is compatible with a number of
>>>> different platforms, should we add a generic compatible to the binding
>>>> as well, to avoid having to update the implementation for each SoC?
>>>>
>>>> (I.e. "qcom,sdm845-cpu-bwmon", "qcom,bwmon-v4")
>>
>> it seems pretty useful to have the "qcom,bwmon-v4" and "qcom,bwmon-v5"
>> compatibles, I tried these patches on a sc7280 device which has a bwmon4
>> between the cpu and caches (and also has a bwmon5 between the caches and DDR)
>> and the driver works with zero changes.
>>
> 
> But does the '4' and '5' has a relation to the hardware? Or is just the
> 4th and 5th register layout supported by the downstream driver?

Right, it was just based on the downstream driver register layouts, i could not
find these numbers in HW specs anywhere, but that said I do see 2 instances of
these, one of them called the LAGG bwmon which is the one between the LLCC and DDR
and is documented as part of the LLCC specs. I'll try and dig somemore into the
documentation to see how we could define compatibles to match hw revs.

> 
> Regards,
> Bjorn

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 4/4] arm64: dts: qcom: sdm845: Add CPU BWMON
  2022-06-28 10:36                 ` Rajendra Nayak
@ 2022-06-28 10:50                   ` Krzysztof Kozlowski
  -1 siblings, 0 replies; 52+ messages in thread
From: Krzysztof Kozlowski @ 2022-06-28 10:50 UTC (permalink / raw)
  To: Rajendra Nayak, Bjorn Andersson
  Cc: Andy Gross, Georgi Djakov, Rob Herring, Catalin Marinas,
	Will Deacon, linux-arm-msm, linux-pm, devicetree, linux-kernel,
	linux-arm-kernel, Thara Gopinath

On 28/06/2022 12:36, Rajendra Nayak wrote:
> 
> On 6/27/2022 6:09 PM, Krzysztof Kozlowski wrote:
>> On 26/06/2022 05:28, Bjorn Andersson wrote:
>>> On Thu 23 Jun 07:58 CDT 2022, Krzysztof Kozlowski wrote:
>>>
>>>> On 23/06/2022 08:48, Rajendra Nayak wrote:
>>>>>>>> diff --git a/arch/arm64/boot/dts/qcom/sdm845.dtsi b/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>>>>>> index 83e8b63f0910..adffb9c70566 100644
>>>>>>>> --- a/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>>>>>> +++ b/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>>>>>> @@ -2026,6 +2026,60 @@ llcc: system-cache-controller@1100000 {
>>>>>>>>     			interrupts = <GIC_SPI 582 IRQ_TYPE_LEVEL_HIGH>;
>>>>>>>>     		};
>>>>>>>>     
>>>>>>>> +		pmu@1436400 {
>>>>>>>> +			compatible = "qcom,sdm845-cpu-bwmon";
>>>>>>>> +			reg = <0 0x01436400 0 0x600>;
>>>>>>>> +
>>>>>>>> +			interrupts = <GIC_SPI 581 IRQ_TYPE_LEVEL_HIGH>;
>>>>>>>> +
>>>>>>>> +			interconnects = <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>,
>>>>>>>> +					<&osm_l3 MASTER_OSM_L3_APPS &osm_l3 SLAVE_OSM_L3>;
>>>>>>>> +			interconnect-names = "ddr", "l3c";
>>>>>>>
>>>>>>> Is this the pmu/bwmon instance between the cpu and caches or the one between the caches and DDR?
>>>>>>
>>>>>> To my understanding this is the one between CPU and caches.
>>>>>
>>>>> Ok, but then because the OPP table lists the DDR bw first and Cache bw second, isn't the driver
>>>>> ending up comparing the bw values thrown by the pmu against the DDR bw instead of the Cache BW?
>>>>
>>>> I double checked now and you're right.
>>>>
>>>>> Atleast with my testing on sc7280 I found this to mess things up and I always was ending up at
>>>>> higher OPPs even while the system was completely idle. Comparing the values against the Cache bw
>>>>> fixed it.(sc7280 also has a bwmon4 instance between the cpu and caches and a bwmon5 between the cache
>>>>> and DDR)
>>>>
>>>> In my case it exposes different issue - under performance. Somehow the
>>>> bwmon does not report bandwidth high enough to vote for high bandwidth.
>>>>
>>>> After removing the DDR interconnect and bandwidth OPP values I have for:
>>>> sysbench --threads=8 --time=60 --memory-total-size=20T --test=memory
>>>> --memory-block-size=4M run
>>>>
>>>> 1. Vanilla: 29768 MB/s
>>>> 2. Vanilla without CPU votes: 8728 MB/s
>>>> 3. Previous bwmon (voting too high): 32007 MB/s
>>>> 4. Fixed bwmon 24911 MB/s
>>>> Bwmon does not vote for maximum L3 speed:
>>>> bwmon report 9408 MB/s (thresholds set: <9216000 15052801>
>>>> )
>>>> osm l3 aggregate 14355 MBps -> 897 MHz, level 7, bw 14355 MBps
>>>>
>>>> Maybe that's just problem with missing governor which would vote for
>>>> bandwidth rounding up or anticipating higher needs.
>>>>
>>>>>>> Depending on which one it is, shouldn;t we just be scaling either one and not both the interconnect paths?
>>>>>>
>>>>>> The interconnects are the same as ones used for CPU nodes, therefore if
>>>>>> we want to scale both when scaling CPU, then we also want to scale both
>>>>>> when seeing traffic between CPU and cache.
>>>>>
>>>>> Well, they were both associated with the CPU node because with no other input to decide on _when_
>>>>> to scale the caches and DDR, we just put a mapping table which simply mapped a CPU freq to a L3 _and_
>>>>> DDR freq. So with just one input (CPU freq) we decided on what should be both the L3 freq and DDR freq.
>>>>>
>>>>> Now with 2 pmu's, we have 2 inputs, so we can individually scale the L3 based on the cache PMU
>>>>> counters and DDR based on the DDR PMU counters, no?
>>>>>
>>>>> Since you said you have plans to add the other pmu support as well (bwmon5 between the cache and DDR)
>>>>> how else would you have the OPP table associated with that pmu instance? Would you again have both the
>>>>> L3 and DDR scale based on the inputs from that bwmon too?
>>>>
>>>> Good point, thanks for sharing. I think you're right. I'll keep only the
>>>> l3c interconnect path.
>>>>
>>>
>>> If I understand correctly, <&osm_l3 MASTER_OSM_L3_APPS &osm_l3
>>> SLAVE_OSM_L3> relates to the L3 cache speed, which sits inside the CPU
>>> subsystem. As such traffic hitting this cache will not show up in either
>>> bwmon instance.
>>>
>>> The path <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>
>>> affects the DDR frequency. So the traffic measured by the cpu-bwmon
>>> would be the CPU subsystems traffic that missed the L1/L2/L3 caches and
>>> hits the memory bus towards DDR.
> 
> That seems right, looking some more into the downstream code and register definitions
> I see the 2 bwmon instances actually lie on the path outside CPU SS towards DDR,
> first one (bwmon4) is between the CPUSS and LLCC (system cache) and the second one
> (bwmon5) between LLCC and DDR. So we should use the counters from bwmon4 to
> scale the CPU-LLCC path (and not L3), on sc7280 that would mean splitting the
> <&gem_noc MASTER_APPSS_PROC 3 &mc_virt SLAVE_EBI1 3> into
> <&gem_noc MASTER_APPSS_PROC 3 &gem_noc SLAVE_LLCC 3> (voting based on the bwmon4 inputs)
> and <&mc_virt MASTER_LLCC 3 &mc_virt SLAVE_EBI1 3> (voting based on the bwmon5 inputs)
> and similar for sdm845 too.
> 
> L3 should perhaps still be voted based on the cpu freq as done today.

This would mean that original bandwidth values (800 - 7216 MB/s) were
correct. However we have still your observation that bwmon kicks in very
fast and my measurements that sampled bwmon data shows bandwidth ~20000
MB/s.


Best regards,
Krzysztof

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 4/4] arm64: dts: qcom: sdm845: Add CPU BWMON
@ 2022-06-28 10:50                   ` Krzysztof Kozlowski
  0 siblings, 0 replies; 52+ messages in thread
From: Krzysztof Kozlowski @ 2022-06-28 10:50 UTC (permalink / raw)
  To: Rajendra Nayak, Bjorn Andersson
  Cc: Andy Gross, Georgi Djakov, Rob Herring, Catalin Marinas,
	Will Deacon, linux-arm-msm, linux-pm, devicetree, linux-kernel,
	linux-arm-kernel, Thara Gopinath

On 28/06/2022 12:36, Rajendra Nayak wrote:
> 
> On 6/27/2022 6:09 PM, Krzysztof Kozlowski wrote:
>> On 26/06/2022 05:28, Bjorn Andersson wrote:
>>> On Thu 23 Jun 07:58 CDT 2022, Krzysztof Kozlowski wrote:
>>>
>>>> On 23/06/2022 08:48, Rajendra Nayak wrote:
>>>>>>>> diff --git a/arch/arm64/boot/dts/qcom/sdm845.dtsi b/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>>>>>> index 83e8b63f0910..adffb9c70566 100644
>>>>>>>> --- a/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>>>>>> +++ b/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>>>>>> @@ -2026,6 +2026,60 @@ llcc: system-cache-controller@1100000 {
>>>>>>>>     			interrupts = <GIC_SPI 582 IRQ_TYPE_LEVEL_HIGH>;
>>>>>>>>     		};
>>>>>>>>     
>>>>>>>> +		pmu@1436400 {
>>>>>>>> +			compatible = "qcom,sdm845-cpu-bwmon";
>>>>>>>> +			reg = <0 0x01436400 0 0x600>;
>>>>>>>> +
>>>>>>>> +			interrupts = <GIC_SPI 581 IRQ_TYPE_LEVEL_HIGH>;
>>>>>>>> +
>>>>>>>> +			interconnects = <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>,
>>>>>>>> +					<&osm_l3 MASTER_OSM_L3_APPS &osm_l3 SLAVE_OSM_L3>;
>>>>>>>> +			interconnect-names = "ddr", "l3c";
>>>>>>>
>>>>>>> Is this the pmu/bwmon instance between the cpu and caches or the one between the caches and DDR?
>>>>>>
>>>>>> To my understanding this is the one between CPU and caches.
>>>>>
>>>>> Ok, but then because the OPP table lists the DDR bw first and Cache bw second, isn't the driver
>>>>> ending up comparing the bw values thrown by the pmu against the DDR bw instead of the Cache BW?
>>>>
>>>> I double checked now and you're right.
>>>>
>>>>> Atleast with my testing on sc7280 I found this to mess things up and I always was ending up at
>>>>> higher OPPs even while the system was completely idle. Comparing the values against the Cache bw
>>>>> fixed it.(sc7280 also has a bwmon4 instance between the cpu and caches and a bwmon5 between the cache
>>>>> and DDR)
>>>>
>>>> In my case it exposes different issue - under performance. Somehow the
>>>> bwmon does not report bandwidth high enough to vote for high bandwidth.
>>>>
>>>> After removing the DDR interconnect and bandwidth OPP values I have for:
>>>> sysbench --threads=8 --time=60 --memory-total-size=20T --test=memory
>>>> --memory-block-size=4M run
>>>>
>>>> 1. Vanilla: 29768 MB/s
>>>> 2. Vanilla without CPU votes: 8728 MB/s
>>>> 3. Previous bwmon (voting too high): 32007 MB/s
>>>> 4. Fixed bwmon 24911 MB/s
>>>> Bwmon does not vote for maximum L3 speed:
>>>> bwmon report 9408 MB/s (thresholds set: <9216000 15052801>
>>>> )
>>>> osm l3 aggregate 14355 MBps -> 897 MHz, level 7, bw 14355 MBps
>>>>
>>>> Maybe that's just problem with missing governor which would vote for
>>>> bandwidth rounding up or anticipating higher needs.
>>>>
>>>>>>> Depending on which one it is, shouldn;t we just be scaling either one and not both the interconnect paths?
>>>>>>
>>>>>> The interconnects are the same as ones used for CPU nodes, therefore if
>>>>>> we want to scale both when scaling CPU, then we also want to scale both
>>>>>> when seeing traffic between CPU and cache.
>>>>>
>>>>> Well, they were both associated with the CPU node because with no other input to decide on _when_
>>>>> to scale the caches and DDR, we just put a mapping table which simply mapped a CPU freq to a L3 _and_
>>>>> DDR freq. So with just one input (CPU freq) we decided on what should be both the L3 freq and DDR freq.
>>>>>
>>>>> Now with 2 pmu's, we have 2 inputs, so we can individually scale the L3 based on the cache PMU
>>>>> counters and DDR based on the DDR PMU counters, no?
>>>>>
>>>>> Since you said you have plans to add the other pmu support as well (bwmon5 between the cache and DDR)
>>>>> how else would you have the OPP table associated with that pmu instance? Would you again have both the
>>>>> L3 and DDR scale based on the inputs from that bwmon too?
>>>>
>>>> Good point, thanks for sharing. I think you're right. I'll keep only the
>>>> l3c interconnect path.
>>>>
>>>
>>> If I understand correctly, <&osm_l3 MASTER_OSM_L3_APPS &osm_l3
>>> SLAVE_OSM_L3> relates to the L3 cache speed, which sits inside the CPU
>>> subsystem. As such traffic hitting this cache will not show up in either
>>> bwmon instance.
>>>
>>> The path <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>
>>> affects the DDR frequency. So the traffic measured by the cpu-bwmon
>>> would be the CPU subsystems traffic that missed the L1/L2/L3 caches and
>>> hits the memory bus towards DDR.
> 
> That seems right, looking some more into the downstream code and register definitions
> I see the 2 bwmon instances actually lie on the path outside CPU SS towards DDR,
> first one (bwmon4) is between the CPUSS and LLCC (system cache) and the second one
> (bwmon5) between LLCC and DDR. So we should use the counters from bwmon4 to
> scale the CPU-LLCC path (and not L3), on sc7280 that would mean splitting the
> <&gem_noc MASTER_APPSS_PROC 3 &mc_virt SLAVE_EBI1 3> into
> <&gem_noc MASTER_APPSS_PROC 3 &gem_noc SLAVE_LLCC 3> (voting based on the bwmon4 inputs)
> and <&mc_virt MASTER_LLCC 3 &mc_virt SLAVE_EBI1 3> (voting based on the bwmon5 inputs)
> and similar for sdm845 too.
> 
> L3 should perhaps still be voted based on the cpu freq as done today.

This would mean that original bandwidth values (800 - 7216 MB/s) were
correct. However we have still your observation that bwmon kicks in very
fast and my measurements that sampled bwmon data shows bandwidth ~20000
MB/s.


Best regards,
Krzysztof

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 4/4] arm64: dts: qcom: sdm845: Add CPU BWMON
  2022-06-28 10:50                   ` Krzysztof Kozlowski
@ 2022-06-28 13:15                     ` Rajendra Nayak
  -1 siblings, 0 replies; 52+ messages in thread
From: Rajendra Nayak @ 2022-06-28 13:15 UTC (permalink / raw)
  To: Krzysztof Kozlowski, Bjorn Andersson
  Cc: Andy Gross, Georgi Djakov, Rob Herring, Catalin Marinas,
	Will Deacon, linux-arm-msm, linux-pm, devicetree, linux-kernel,
	linux-arm-kernel, Thara Gopinath



On 6/28/2022 4:20 PM, Krzysztof Kozlowski wrote:
> On 28/06/2022 12:36, Rajendra Nayak wrote:
>>
>> On 6/27/2022 6:09 PM, Krzysztof Kozlowski wrote:
>>> On 26/06/2022 05:28, Bjorn Andersson wrote:
>>>> On Thu 23 Jun 07:58 CDT 2022, Krzysztof Kozlowski wrote:
>>>>
>>>>> On 23/06/2022 08:48, Rajendra Nayak wrote:
>>>>>>>>> diff --git a/arch/arm64/boot/dts/qcom/sdm845.dtsi b/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>>>>>>> index 83e8b63f0910..adffb9c70566 100644
>>>>>>>>> --- a/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>>>>>>> +++ b/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>>>>>>> @@ -2026,6 +2026,60 @@ llcc: system-cache-controller@1100000 {
>>>>>>>>>      			interrupts = <GIC_SPI 582 IRQ_TYPE_LEVEL_HIGH>;
>>>>>>>>>      		};
>>>>>>>>>      
>>>>>>>>> +		pmu@1436400 {
>>>>>>>>> +			compatible = "qcom,sdm845-cpu-bwmon";
>>>>>>>>> +			reg = <0 0x01436400 0 0x600>;
>>>>>>>>> +
>>>>>>>>> +			interrupts = <GIC_SPI 581 IRQ_TYPE_LEVEL_HIGH>;
>>>>>>>>> +
>>>>>>>>> +			interconnects = <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>,
>>>>>>>>> +					<&osm_l3 MASTER_OSM_L3_APPS &osm_l3 SLAVE_OSM_L3>;
>>>>>>>>> +			interconnect-names = "ddr", "l3c";
>>>>>>>>
>>>>>>>> Is this the pmu/bwmon instance between the cpu and caches or the one between the caches and DDR?
>>>>>>>
>>>>>>> To my understanding this is the one between CPU and caches.
>>>>>>
>>>>>> Ok, but then because the OPP table lists the DDR bw first and Cache bw second, isn't the driver
>>>>>> ending up comparing the bw values thrown by the pmu against the DDR bw instead of the Cache BW?
>>>>>
>>>>> I double checked now and you're right.
>>>>>
>>>>>> Atleast with my testing on sc7280 I found this to mess things up and I always was ending up at
>>>>>> higher OPPs even while the system was completely idle. Comparing the values against the Cache bw
>>>>>> fixed it.(sc7280 also has a bwmon4 instance between the cpu and caches and a bwmon5 between the cache
>>>>>> and DDR)
>>>>>
>>>>> In my case it exposes different issue - under performance. Somehow the
>>>>> bwmon does not report bandwidth high enough to vote for high bandwidth.
>>>>>
>>>>> After removing the DDR interconnect and bandwidth OPP values I have for:
>>>>> sysbench --threads=8 --time=60 --memory-total-size=20T --test=memory
>>>>> --memory-block-size=4M run
>>>>>
>>>>> 1. Vanilla: 29768 MB/s
>>>>> 2. Vanilla without CPU votes: 8728 MB/s
>>>>> 3. Previous bwmon (voting too high): 32007 MB/s
>>>>> 4. Fixed bwmon 24911 MB/s
>>>>> Bwmon does not vote for maximum L3 speed:
>>>>> bwmon report 9408 MB/s (thresholds set: <9216000 15052801>
>>>>> )
>>>>> osm l3 aggregate 14355 MBps -> 897 MHz, level 7, bw 14355 MBps
>>>>>
>>>>> Maybe that's just problem with missing governor which would vote for
>>>>> bandwidth rounding up or anticipating higher needs.
>>>>>
>>>>>>>> Depending on which one it is, shouldn;t we just be scaling either one and not both the interconnect paths?
>>>>>>>
>>>>>>> The interconnects are the same as ones used for CPU nodes, therefore if
>>>>>>> we want to scale both when scaling CPU, then we also want to scale both
>>>>>>> when seeing traffic between CPU and cache.
>>>>>>
>>>>>> Well, they were both associated with the CPU node because with no other input to decide on _when_
>>>>>> to scale the caches and DDR, we just put a mapping table which simply mapped a CPU freq to a L3 _and_
>>>>>> DDR freq. So with just one input (CPU freq) we decided on what should be both the L3 freq and DDR freq.
>>>>>>
>>>>>> Now with 2 pmu's, we have 2 inputs, so we can individually scale the L3 based on the cache PMU
>>>>>> counters and DDR based on the DDR PMU counters, no?
>>>>>>
>>>>>> Since you said you have plans to add the other pmu support as well (bwmon5 between the cache and DDR)
>>>>>> how else would you have the OPP table associated with that pmu instance? Would you again have both the
>>>>>> L3 and DDR scale based on the inputs from that bwmon too?
>>>>>
>>>>> Good point, thanks for sharing. I think you're right. I'll keep only the
>>>>> l3c interconnect path.
>>>>>
>>>>
>>>> If I understand correctly, <&osm_l3 MASTER_OSM_L3_APPS &osm_l3
>>>> SLAVE_OSM_L3> relates to the L3 cache speed, which sits inside the CPU
>>>> subsystem. As such traffic hitting this cache will not show up in either
>>>> bwmon instance.
>>>>
>>>> The path <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>
>>>> affects the DDR frequency. So the traffic measured by the cpu-bwmon
>>>> would be the CPU subsystems traffic that missed the L1/L2/L3 caches and
>>>> hits the memory bus towards DDR.
>>
>> That seems right, looking some more into the downstream code and register definitions
>> I see the 2 bwmon instances actually lie on the path outside CPU SS towards DDR,
>> first one (bwmon4) is between the CPUSS and LLCC (system cache) and the second one
>> (bwmon5) between LLCC and DDR. So we should use the counters from bwmon4 to
>> scale the CPU-LLCC path (and not L3), on sc7280 that would mean splitting the
>> <&gem_noc MASTER_APPSS_PROC 3 &mc_virt SLAVE_EBI1 3> into
>> <&gem_noc MASTER_APPSS_PROC 3 &gem_noc SLAVE_LLCC 3> (voting based on the bwmon4 inputs)
>> and <&mc_virt MASTER_LLCC 3 &mc_virt SLAVE_EBI1 3> (voting based on the bwmon5 inputs)
>> and similar for sdm845 too.
>>
>> L3 should perhaps still be voted based on the cpu freq as done today.
> 
> This would mean that original bandwidth values (800 - 7216 MB/s) were
> correct. However we have still your observation that bwmon kicks in very
> fast and my measurements that sampled bwmon data shows bandwidth ~20000
> MB/s.

Right, thats because the bandwidth supported between the cpu<->llcc path is much higher
than the DDR frequencies. For instance on sc7280, I see (2288 - 15258 MB/s) for LLCC while
the DDR max is 8532 MB/s.

> 
> 
> Best regards,
> Krzysztof

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 4/4] arm64: dts: qcom: sdm845: Add CPU BWMON
@ 2022-06-28 13:15                     ` Rajendra Nayak
  0 siblings, 0 replies; 52+ messages in thread
From: Rajendra Nayak @ 2022-06-28 13:15 UTC (permalink / raw)
  To: Krzysztof Kozlowski, Bjorn Andersson
  Cc: Andy Gross, Georgi Djakov, Rob Herring, Catalin Marinas,
	Will Deacon, linux-arm-msm, linux-pm, devicetree, linux-kernel,
	linux-arm-kernel, Thara Gopinath



On 6/28/2022 4:20 PM, Krzysztof Kozlowski wrote:
> On 28/06/2022 12:36, Rajendra Nayak wrote:
>>
>> On 6/27/2022 6:09 PM, Krzysztof Kozlowski wrote:
>>> On 26/06/2022 05:28, Bjorn Andersson wrote:
>>>> On Thu 23 Jun 07:58 CDT 2022, Krzysztof Kozlowski wrote:
>>>>
>>>>> On 23/06/2022 08:48, Rajendra Nayak wrote:
>>>>>>>>> diff --git a/arch/arm64/boot/dts/qcom/sdm845.dtsi b/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>>>>>>> index 83e8b63f0910..adffb9c70566 100644
>>>>>>>>> --- a/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>>>>>>> +++ b/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>>>>>>> @@ -2026,6 +2026,60 @@ llcc: system-cache-controller@1100000 {
>>>>>>>>>      			interrupts = <GIC_SPI 582 IRQ_TYPE_LEVEL_HIGH>;
>>>>>>>>>      		};
>>>>>>>>>      
>>>>>>>>> +		pmu@1436400 {
>>>>>>>>> +			compatible = "qcom,sdm845-cpu-bwmon";
>>>>>>>>> +			reg = <0 0x01436400 0 0x600>;
>>>>>>>>> +
>>>>>>>>> +			interrupts = <GIC_SPI 581 IRQ_TYPE_LEVEL_HIGH>;
>>>>>>>>> +
>>>>>>>>> +			interconnects = <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>,
>>>>>>>>> +					<&osm_l3 MASTER_OSM_L3_APPS &osm_l3 SLAVE_OSM_L3>;
>>>>>>>>> +			interconnect-names = "ddr", "l3c";
>>>>>>>>
>>>>>>>> Is this the pmu/bwmon instance between the cpu and caches or the one between the caches and DDR?
>>>>>>>
>>>>>>> To my understanding this is the one between CPU and caches.
>>>>>>
>>>>>> Ok, but then because the OPP table lists the DDR bw first and Cache bw second, isn't the driver
>>>>>> ending up comparing the bw values thrown by the pmu against the DDR bw instead of the Cache BW?
>>>>>
>>>>> I double checked now and you're right.
>>>>>
>>>>>> Atleast with my testing on sc7280 I found this to mess things up and I always was ending up at
>>>>>> higher OPPs even while the system was completely idle. Comparing the values against the Cache bw
>>>>>> fixed it.(sc7280 also has a bwmon4 instance between the cpu and caches and a bwmon5 between the cache
>>>>>> and DDR)
>>>>>
>>>>> In my case it exposes different issue - under performance. Somehow the
>>>>> bwmon does not report bandwidth high enough to vote for high bandwidth.
>>>>>
>>>>> After removing the DDR interconnect and bandwidth OPP values I have for:
>>>>> sysbench --threads=8 --time=60 --memory-total-size=20T --test=memory
>>>>> --memory-block-size=4M run
>>>>>
>>>>> 1. Vanilla: 29768 MB/s
>>>>> 2. Vanilla without CPU votes: 8728 MB/s
>>>>> 3. Previous bwmon (voting too high): 32007 MB/s
>>>>> 4. Fixed bwmon 24911 MB/s
>>>>> Bwmon does not vote for maximum L3 speed:
>>>>> bwmon report 9408 MB/s (thresholds set: <9216000 15052801>
>>>>> )
>>>>> osm l3 aggregate 14355 MBps -> 897 MHz, level 7, bw 14355 MBps
>>>>>
>>>>> Maybe that's just problem with missing governor which would vote for
>>>>> bandwidth rounding up or anticipating higher needs.
>>>>>
>>>>>>>> Depending on which one it is, shouldn;t we just be scaling either one and not both the interconnect paths?
>>>>>>>
>>>>>>> The interconnects are the same as ones used for CPU nodes, therefore if
>>>>>>> we want to scale both when scaling CPU, then we also want to scale both
>>>>>>> when seeing traffic between CPU and cache.
>>>>>>
>>>>>> Well, they were both associated with the CPU node because with no other input to decide on _when_
>>>>>> to scale the caches and DDR, we just put a mapping table which simply mapped a CPU freq to a L3 _and_
>>>>>> DDR freq. So with just one input (CPU freq) we decided on what should be both the L3 freq and DDR freq.
>>>>>>
>>>>>> Now with 2 pmu's, we have 2 inputs, so we can individually scale the L3 based on the cache PMU
>>>>>> counters and DDR based on the DDR PMU counters, no?
>>>>>>
>>>>>> Since you said you have plans to add the other pmu support as well (bwmon5 between the cache and DDR)
>>>>>> how else would you have the OPP table associated with that pmu instance? Would you again have both the
>>>>>> L3 and DDR scale based on the inputs from that bwmon too?
>>>>>
>>>>> Good point, thanks for sharing. I think you're right. I'll keep only the
>>>>> l3c interconnect path.
>>>>>
>>>>
>>>> If I understand correctly, <&osm_l3 MASTER_OSM_L3_APPS &osm_l3
>>>> SLAVE_OSM_L3> relates to the L3 cache speed, which sits inside the CPU
>>>> subsystem. As such traffic hitting this cache will not show up in either
>>>> bwmon instance.
>>>>
>>>> The path <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>
>>>> affects the DDR frequency. So the traffic measured by the cpu-bwmon
>>>> would be the CPU subsystems traffic that missed the L1/L2/L3 caches and
>>>> hits the memory bus towards DDR.
>>
>> That seems right, looking some more into the downstream code and register definitions
>> I see the 2 bwmon instances actually lie on the path outside CPU SS towards DDR,
>> first one (bwmon4) is between the CPUSS and LLCC (system cache) and the second one
>> (bwmon5) between LLCC and DDR. So we should use the counters from bwmon4 to
>> scale the CPU-LLCC path (and not L3), on sc7280 that would mean splitting the
>> <&gem_noc MASTER_APPSS_PROC 3 &mc_virt SLAVE_EBI1 3> into
>> <&gem_noc MASTER_APPSS_PROC 3 &gem_noc SLAVE_LLCC 3> (voting based on the bwmon4 inputs)
>> and <&mc_virt MASTER_LLCC 3 &mc_virt SLAVE_EBI1 3> (voting based on the bwmon5 inputs)
>> and similar for sdm845 too.
>>
>> L3 should perhaps still be voted based on the cpu freq as done today.
> 
> This would mean that original bandwidth values (800 - 7216 MB/s) were
> correct. However we have still your observation that bwmon kicks in very
> fast and my measurements that sampled bwmon data shows bandwidth ~20000
> MB/s.

Right, thats because the bandwidth supported between the cpu<->llcc path is much higher
than the DDR frequencies. For instance on sc7280, I see (2288 - 15258 MB/s) for LLCC while
the DDR max is 8532 MB/s.

> 
> 
> Best regards,
> Krzysztof

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 4/4] arm64: dts: qcom: sdm845: Add CPU BWMON
  2022-06-28 13:15                     ` Rajendra Nayak
@ 2022-06-28 14:02                       ` Krzysztof Kozlowski
  -1 siblings, 0 replies; 52+ messages in thread
From: Krzysztof Kozlowski @ 2022-06-28 14:02 UTC (permalink / raw)
  To: Rajendra Nayak, Bjorn Andersson
  Cc: Andy Gross, Georgi Djakov, Rob Herring, Catalin Marinas,
	Will Deacon, linux-arm-msm, linux-pm, devicetree, linux-kernel,
	linux-arm-kernel, Thara Gopinath

On 28/06/2022 15:15, Rajendra Nayak wrote:
> 
> 
> On 6/28/2022 4:20 PM, Krzysztof Kozlowski wrote:
>> On 28/06/2022 12:36, Rajendra Nayak wrote:
>>>
>>> On 6/27/2022 6:09 PM, Krzysztof Kozlowski wrote:
>>>> On 26/06/2022 05:28, Bjorn Andersson wrote:
>>>>> On Thu 23 Jun 07:58 CDT 2022, Krzysztof Kozlowski wrote:
>>>>>
>>>>>> On 23/06/2022 08:48, Rajendra Nayak wrote:
>>>>>>>>>> diff --git a/arch/arm64/boot/dts/qcom/sdm845.dtsi b/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>>>>>>>> index 83e8b63f0910..adffb9c70566 100644
>>>>>>>>>> --- a/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>>>>>>>> +++ b/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>>>>>>>> @@ -2026,6 +2026,60 @@ llcc: system-cache-controller@1100000 {
>>>>>>>>>>      			interrupts = <GIC_SPI 582 IRQ_TYPE_LEVEL_HIGH>;
>>>>>>>>>>      		};
>>>>>>>>>>      
>>>>>>>>>> +		pmu@1436400 {
>>>>>>>>>> +			compatible = "qcom,sdm845-cpu-bwmon";
>>>>>>>>>> +			reg = <0 0x01436400 0 0x600>;
>>>>>>>>>> +
>>>>>>>>>> +			interrupts = <GIC_SPI 581 IRQ_TYPE_LEVEL_HIGH>;
>>>>>>>>>> +
>>>>>>>>>> +			interconnects = <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>,
>>>>>>>>>> +					<&osm_l3 MASTER_OSM_L3_APPS &osm_l3 SLAVE_OSM_L3>;
>>>>>>>>>> +			interconnect-names = "ddr", "l3c";
>>>>>>>>>
>>>>>>>>> Is this the pmu/bwmon instance between the cpu and caches or the one between the caches and DDR?
>>>>>>>>
>>>>>>>> To my understanding this is the one between CPU and caches.
>>>>>>>
>>>>>>> Ok, but then because the OPP table lists the DDR bw first and Cache bw second, isn't the driver
>>>>>>> ending up comparing the bw values thrown by the pmu against the DDR bw instead of the Cache BW?
>>>>>>
>>>>>> I double checked now and you're right.
>>>>>>
>>>>>>> Atleast with my testing on sc7280 I found this to mess things up and I always was ending up at
>>>>>>> higher OPPs even while the system was completely idle. Comparing the values against the Cache bw
>>>>>>> fixed it.(sc7280 also has a bwmon4 instance between the cpu and caches and a bwmon5 between the cache
>>>>>>> and DDR)
>>>>>>
>>>>>> In my case it exposes different issue - under performance. Somehow the
>>>>>> bwmon does not report bandwidth high enough to vote for high bandwidth.
>>>>>>
>>>>>> After removing the DDR interconnect and bandwidth OPP values I have for:
>>>>>> sysbench --threads=8 --time=60 --memory-total-size=20T --test=memory
>>>>>> --memory-block-size=4M run
>>>>>>
>>>>>> 1. Vanilla: 29768 MB/s
>>>>>> 2. Vanilla without CPU votes: 8728 MB/s
>>>>>> 3. Previous bwmon (voting too high): 32007 MB/s
>>>>>> 4. Fixed bwmon 24911 MB/s
>>>>>> Bwmon does not vote for maximum L3 speed:
>>>>>> bwmon report 9408 MB/s (thresholds set: <9216000 15052801>
>>>>>> )
>>>>>> osm l3 aggregate 14355 MBps -> 897 MHz, level 7, bw 14355 MBps
>>>>>>
>>>>>> Maybe that's just problem with missing governor which would vote for
>>>>>> bandwidth rounding up or anticipating higher needs.
>>>>>>
>>>>>>>>> Depending on which one it is, shouldn;t we just be scaling either one and not both the interconnect paths?
>>>>>>>>
>>>>>>>> The interconnects are the same as ones used for CPU nodes, therefore if
>>>>>>>> we want to scale both when scaling CPU, then we also want to scale both
>>>>>>>> when seeing traffic between CPU and cache.
>>>>>>>
>>>>>>> Well, they were both associated with the CPU node because with no other input to decide on _when_
>>>>>>> to scale the caches and DDR, we just put a mapping table which simply mapped a CPU freq to a L3 _and_
>>>>>>> DDR freq. So with just one input (CPU freq) we decided on what should be both the L3 freq and DDR freq.
>>>>>>>
>>>>>>> Now with 2 pmu's, we have 2 inputs, so we can individually scale the L3 based on the cache PMU
>>>>>>> counters and DDR based on the DDR PMU counters, no?
>>>>>>>
>>>>>>> Since you said you have plans to add the other pmu support as well (bwmon5 between the cache and DDR)
>>>>>>> how else would you have the OPP table associated with that pmu instance? Would you again have both the
>>>>>>> L3 and DDR scale based on the inputs from that bwmon too?
>>>>>>
>>>>>> Good point, thanks for sharing. I think you're right. I'll keep only the
>>>>>> l3c interconnect path.
>>>>>>
>>>>>
>>>>> If I understand correctly, <&osm_l3 MASTER_OSM_L3_APPS &osm_l3
>>>>> SLAVE_OSM_L3> relates to the L3 cache speed, which sits inside the CPU
>>>>> subsystem. As such traffic hitting this cache will not show up in either
>>>>> bwmon instance.
>>>>>
>>>>> The path <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>
>>>>> affects the DDR frequency. So the traffic measured by the cpu-bwmon
>>>>> would be the CPU subsystems traffic that missed the L1/L2/L3 caches and
>>>>> hits the memory bus towards DDR.
>>>
>>> That seems right, looking some more into the downstream code and register definitions
>>> I see the 2 bwmon instances actually lie on the path outside CPU SS towards DDR,
>>> first one (bwmon4) is between the CPUSS and LLCC (system cache) and the second one
>>> (bwmon5) between LLCC and DDR. So we should use the counters from bwmon4 to
>>> scale the CPU-LLCC path (and not L3), on sc7280 that would mean splitting the
>>> <&gem_noc MASTER_APPSS_PROC 3 &mc_virt SLAVE_EBI1 3> into
>>> <&gem_noc MASTER_APPSS_PROC 3 &gem_noc SLAVE_LLCC 3> (voting based on the bwmon4 inputs)

For sdm845 SLAVE_LLCC is in mem_noc, so I guess mc_virt on sc7280?

>>> and <&mc_virt MASTER_LLCC 3 &mc_virt SLAVE_EBI1 3> (voting based on the bwmon5 inputs)
>>> and similar for sdm845 too.
>>>
>>> L3 should perhaps still be voted based on the cpu freq as done today.
>>
>> This would mean that original bandwidth values (800 - 7216 MB/s) were
>> correct. However we have still your observation that bwmon kicks in very
>> fast and my measurements that sampled bwmon data shows bandwidth ~20000
>> MB/s.
> 
> Right, thats because the bandwidth supported between the cpu<->llcc path is much higher
> than the DDR frequencies. For instance on sc7280, I see (2288 - 15258 MB/s) for LLCC while
> the DDR max is 8532 MB/s.

OK, that sounds right.

Another point is that I did not find actual scaling of throughput via
that interconnect path:
<&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_LLCC 3>

so I cannot test impact of bwmon that way.

Best regards,
Krzysztof

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 4/4] arm64: dts: qcom: sdm845: Add CPU BWMON
@ 2022-06-28 14:02                       ` Krzysztof Kozlowski
  0 siblings, 0 replies; 52+ messages in thread
From: Krzysztof Kozlowski @ 2022-06-28 14:02 UTC (permalink / raw)
  To: Rajendra Nayak, Bjorn Andersson
  Cc: Andy Gross, Georgi Djakov, Rob Herring, Catalin Marinas,
	Will Deacon, linux-arm-msm, linux-pm, devicetree, linux-kernel,
	linux-arm-kernel, Thara Gopinath

On 28/06/2022 15:15, Rajendra Nayak wrote:
> 
> 
> On 6/28/2022 4:20 PM, Krzysztof Kozlowski wrote:
>> On 28/06/2022 12:36, Rajendra Nayak wrote:
>>>
>>> On 6/27/2022 6:09 PM, Krzysztof Kozlowski wrote:
>>>> On 26/06/2022 05:28, Bjorn Andersson wrote:
>>>>> On Thu 23 Jun 07:58 CDT 2022, Krzysztof Kozlowski wrote:
>>>>>
>>>>>> On 23/06/2022 08:48, Rajendra Nayak wrote:
>>>>>>>>>> diff --git a/arch/arm64/boot/dts/qcom/sdm845.dtsi b/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>>>>>>>> index 83e8b63f0910..adffb9c70566 100644
>>>>>>>>>> --- a/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>>>>>>>> +++ b/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>>>>>>>> @@ -2026,6 +2026,60 @@ llcc: system-cache-controller@1100000 {
>>>>>>>>>>      			interrupts = <GIC_SPI 582 IRQ_TYPE_LEVEL_HIGH>;
>>>>>>>>>>      		};
>>>>>>>>>>      
>>>>>>>>>> +		pmu@1436400 {
>>>>>>>>>> +			compatible = "qcom,sdm845-cpu-bwmon";
>>>>>>>>>> +			reg = <0 0x01436400 0 0x600>;
>>>>>>>>>> +
>>>>>>>>>> +			interrupts = <GIC_SPI 581 IRQ_TYPE_LEVEL_HIGH>;
>>>>>>>>>> +
>>>>>>>>>> +			interconnects = <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>,
>>>>>>>>>> +					<&osm_l3 MASTER_OSM_L3_APPS &osm_l3 SLAVE_OSM_L3>;
>>>>>>>>>> +			interconnect-names = "ddr", "l3c";
>>>>>>>>>
>>>>>>>>> Is this the pmu/bwmon instance between the cpu and caches or the one between the caches and DDR?
>>>>>>>>
>>>>>>>> To my understanding this is the one between CPU and caches.
>>>>>>>
>>>>>>> Ok, but then because the OPP table lists the DDR bw first and Cache bw second, isn't the driver
>>>>>>> ending up comparing the bw values thrown by the pmu against the DDR bw instead of the Cache BW?
>>>>>>
>>>>>> I double checked now and you're right.
>>>>>>
>>>>>>> Atleast with my testing on sc7280 I found this to mess things up and I always was ending up at
>>>>>>> higher OPPs even while the system was completely idle. Comparing the values against the Cache bw
>>>>>>> fixed it.(sc7280 also has a bwmon4 instance between the cpu and caches and a bwmon5 between the cache
>>>>>>> and DDR)
>>>>>>
>>>>>> In my case it exposes different issue - under performance. Somehow the
>>>>>> bwmon does not report bandwidth high enough to vote for high bandwidth.
>>>>>>
>>>>>> After removing the DDR interconnect and bandwidth OPP values I have for:
>>>>>> sysbench --threads=8 --time=60 --memory-total-size=20T --test=memory
>>>>>> --memory-block-size=4M run
>>>>>>
>>>>>> 1. Vanilla: 29768 MB/s
>>>>>> 2. Vanilla without CPU votes: 8728 MB/s
>>>>>> 3. Previous bwmon (voting too high): 32007 MB/s
>>>>>> 4. Fixed bwmon 24911 MB/s
>>>>>> Bwmon does not vote for maximum L3 speed:
>>>>>> bwmon report 9408 MB/s (thresholds set: <9216000 15052801>
>>>>>> )
>>>>>> osm l3 aggregate 14355 MBps -> 897 MHz, level 7, bw 14355 MBps
>>>>>>
>>>>>> Maybe that's just problem with missing governor which would vote for
>>>>>> bandwidth rounding up or anticipating higher needs.
>>>>>>
>>>>>>>>> Depending on which one it is, shouldn;t we just be scaling either one and not both the interconnect paths?
>>>>>>>>
>>>>>>>> The interconnects are the same as ones used for CPU nodes, therefore if
>>>>>>>> we want to scale both when scaling CPU, then we also want to scale both
>>>>>>>> when seeing traffic between CPU and cache.
>>>>>>>
>>>>>>> Well, they were both associated with the CPU node because with no other input to decide on _when_
>>>>>>> to scale the caches and DDR, we just put a mapping table which simply mapped a CPU freq to a L3 _and_
>>>>>>> DDR freq. So with just one input (CPU freq) we decided on what should be both the L3 freq and DDR freq.
>>>>>>>
>>>>>>> Now with 2 pmu's, we have 2 inputs, so we can individually scale the L3 based on the cache PMU
>>>>>>> counters and DDR based on the DDR PMU counters, no?
>>>>>>>
>>>>>>> Since you said you have plans to add the other pmu support as well (bwmon5 between the cache and DDR)
>>>>>>> how else would you have the OPP table associated with that pmu instance? Would you again have both the
>>>>>>> L3 and DDR scale based on the inputs from that bwmon too?
>>>>>>
>>>>>> Good point, thanks for sharing. I think you're right. I'll keep only the
>>>>>> l3c interconnect path.
>>>>>>
>>>>>
>>>>> If I understand correctly, <&osm_l3 MASTER_OSM_L3_APPS &osm_l3
>>>>> SLAVE_OSM_L3> relates to the L3 cache speed, which sits inside the CPU
>>>>> subsystem. As such traffic hitting this cache will not show up in either
>>>>> bwmon instance.
>>>>>
>>>>> The path <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>
>>>>> affects the DDR frequency. So the traffic measured by the cpu-bwmon
>>>>> would be the CPU subsystems traffic that missed the L1/L2/L3 caches and
>>>>> hits the memory bus towards DDR.
>>>
>>> That seems right, looking some more into the downstream code and register definitions
>>> I see the 2 bwmon instances actually lie on the path outside CPU SS towards DDR,
>>> first one (bwmon4) is between the CPUSS and LLCC (system cache) and the second one
>>> (bwmon5) between LLCC and DDR. So we should use the counters from bwmon4 to
>>> scale the CPU-LLCC path (and not L3), on sc7280 that would mean splitting the
>>> <&gem_noc MASTER_APPSS_PROC 3 &mc_virt SLAVE_EBI1 3> into
>>> <&gem_noc MASTER_APPSS_PROC 3 &gem_noc SLAVE_LLCC 3> (voting based on the bwmon4 inputs)

For sdm845 SLAVE_LLCC is in mem_noc, so I guess mc_virt on sc7280?

>>> and <&mc_virt MASTER_LLCC 3 &mc_virt SLAVE_EBI1 3> (voting based on the bwmon5 inputs)
>>> and similar for sdm845 too.
>>>
>>> L3 should perhaps still be voted based on the cpu freq as done today.
>>
>> This would mean that original bandwidth values (800 - 7216 MB/s) were
>> correct. However we have still your observation that bwmon kicks in very
>> fast and my measurements that sampled bwmon data shows bandwidth ~20000
>> MB/s.
> 
> Right, thats because the bandwidth supported between the cpu<->llcc path is much higher
> than the DDR frequencies. For instance on sc7280, I see (2288 - 15258 MB/s) for LLCC while
> the DDR max is 8532 MB/s.

OK, that sounds right.

Another point is that I did not find actual scaling of throughput via
that interconnect path:
<&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_LLCC 3>

so I cannot test impact of bwmon that way.

Best regards,
Krzysztof

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 4/4] arm64: dts: qcom: sdm845: Add CPU BWMON
  2022-06-28 14:02                       ` Krzysztof Kozlowski
@ 2022-06-28 15:20                         ` Rajendra Nayak
  -1 siblings, 0 replies; 52+ messages in thread
From: Rajendra Nayak @ 2022-06-28 15:20 UTC (permalink / raw)
  To: Krzysztof Kozlowski, Bjorn Andersson
  Cc: Andy Gross, Georgi Djakov, Rob Herring, Catalin Marinas,
	Will Deacon, linux-arm-msm, linux-pm, devicetree, linux-kernel,
	linux-arm-kernel, Thara Gopinath



On 6/28/2022 7:32 PM, Krzysztof Kozlowski wrote:
> On 28/06/2022 15:15, Rajendra Nayak wrote:
>>
>>
>> On 6/28/2022 4:20 PM, Krzysztof Kozlowski wrote:
>>> On 28/06/2022 12:36, Rajendra Nayak wrote:
>>>>
>>>> On 6/27/2022 6:09 PM, Krzysztof Kozlowski wrote:
>>>>> On 26/06/2022 05:28, Bjorn Andersson wrote:
>>>>>> On Thu 23 Jun 07:58 CDT 2022, Krzysztof Kozlowski wrote:
>>>>>>
>>>>>>> On 23/06/2022 08:48, Rajendra Nayak wrote:
>>>>>>>>>>> diff --git a/arch/arm64/boot/dts/qcom/sdm845.dtsi b/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>>>>>>>>> index 83e8b63f0910..adffb9c70566 100644
>>>>>>>>>>> --- a/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>>>>>>>>> +++ b/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>>>>>>>>> @@ -2026,6 +2026,60 @@ llcc: system-cache-controller@1100000 {
>>>>>>>>>>>       			interrupts = <GIC_SPI 582 IRQ_TYPE_LEVEL_HIGH>;
>>>>>>>>>>>       		};
>>>>>>>>>>>       
>>>>>>>>>>> +		pmu@1436400 {
>>>>>>>>>>> +			compatible = "qcom,sdm845-cpu-bwmon";
>>>>>>>>>>> +			reg = <0 0x01436400 0 0x600>;
>>>>>>>>>>> +
>>>>>>>>>>> +			interrupts = <GIC_SPI 581 IRQ_TYPE_LEVEL_HIGH>;
>>>>>>>>>>> +
>>>>>>>>>>> +			interconnects = <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>,
>>>>>>>>>>> +					<&osm_l3 MASTER_OSM_L3_APPS &osm_l3 SLAVE_OSM_L3>;
>>>>>>>>>>> +			interconnect-names = "ddr", "l3c";
>>>>>>>>>>
>>>>>>>>>> Is this the pmu/bwmon instance between the cpu and caches or the one between the caches and DDR?
>>>>>>>>>
>>>>>>>>> To my understanding this is the one between CPU and caches.
>>>>>>>>
>>>>>>>> Ok, but then because the OPP table lists the DDR bw first and Cache bw second, isn't the driver
>>>>>>>> ending up comparing the bw values thrown by the pmu against the DDR bw instead of the Cache BW?
>>>>>>>
>>>>>>> I double checked now and you're right.
>>>>>>>
>>>>>>>> Atleast with my testing on sc7280 I found this to mess things up and I always was ending up at
>>>>>>>> higher OPPs even while the system was completely idle. Comparing the values against the Cache bw
>>>>>>>> fixed it.(sc7280 also has a bwmon4 instance between the cpu and caches and a bwmon5 between the cache
>>>>>>>> and DDR)
>>>>>>>
>>>>>>> In my case it exposes different issue - under performance. Somehow the
>>>>>>> bwmon does not report bandwidth high enough to vote for high bandwidth.
>>>>>>>
>>>>>>> After removing the DDR interconnect and bandwidth OPP values I have for:
>>>>>>> sysbench --threads=8 --time=60 --memory-total-size=20T --test=memory
>>>>>>> --memory-block-size=4M run
>>>>>>>
>>>>>>> 1. Vanilla: 29768 MB/s
>>>>>>> 2. Vanilla without CPU votes: 8728 MB/s
>>>>>>> 3. Previous bwmon (voting too high): 32007 MB/s
>>>>>>> 4. Fixed bwmon 24911 MB/s
>>>>>>> Bwmon does not vote for maximum L3 speed:
>>>>>>> bwmon report 9408 MB/s (thresholds set: <9216000 15052801>
>>>>>>> )
>>>>>>> osm l3 aggregate 14355 MBps -> 897 MHz, level 7, bw 14355 MBps
>>>>>>>
>>>>>>> Maybe that's just problem with missing governor which would vote for
>>>>>>> bandwidth rounding up or anticipating higher needs.
>>>>>>>
>>>>>>>>>> Depending on which one it is, shouldn;t we just be scaling either one and not both the interconnect paths?
>>>>>>>>>
>>>>>>>>> The interconnects are the same as ones used for CPU nodes, therefore if
>>>>>>>>> we want to scale both when scaling CPU, then we also want to scale both
>>>>>>>>> when seeing traffic between CPU and cache.
>>>>>>>>
>>>>>>>> Well, they were both associated with the CPU node because with no other input to decide on _when_
>>>>>>>> to scale the caches and DDR, we just put a mapping table which simply mapped a CPU freq to a L3 _and_
>>>>>>>> DDR freq. So with just one input (CPU freq) we decided on what should be both the L3 freq and DDR freq.
>>>>>>>>
>>>>>>>> Now with 2 pmu's, we have 2 inputs, so we can individually scale the L3 based on the cache PMU
>>>>>>>> counters and DDR based on the DDR PMU counters, no?
>>>>>>>>
>>>>>>>> Since you said you have plans to add the other pmu support as well (bwmon5 between the cache and DDR)
>>>>>>>> how else would you have the OPP table associated with that pmu instance? Would you again have both the
>>>>>>>> L3 and DDR scale based on the inputs from that bwmon too?
>>>>>>>
>>>>>>> Good point, thanks for sharing. I think you're right. I'll keep only the
>>>>>>> l3c interconnect path.
>>>>>>>
>>>>>>
>>>>>> If I understand correctly, <&osm_l3 MASTER_OSM_L3_APPS &osm_l3
>>>>>> SLAVE_OSM_L3> relates to the L3 cache speed, which sits inside the CPU
>>>>>> subsystem. As such traffic hitting this cache will not show up in either
>>>>>> bwmon instance.
>>>>>>
>>>>>> The path <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>
>>>>>> affects the DDR frequency. So the traffic measured by the cpu-bwmon
>>>>>> would be the CPU subsystems traffic that missed the L1/L2/L3 caches and
>>>>>> hits the memory bus towards DDR.
>>>>
>>>> That seems right, looking some more into the downstream code and register definitions
>>>> I see the 2 bwmon instances actually lie on the path outside CPU SS towards DDR,
>>>> first one (bwmon4) is between the CPUSS and LLCC (system cache) and the second one
>>>> (bwmon5) between LLCC and DDR. So we should use the counters from bwmon4 to
>>>> scale the CPU-LLCC path (and not L3), on sc7280 that would mean splitting the
>>>> <&gem_noc MASTER_APPSS_PROC 3 &mc_virt SLAVE_EBI1 3> into
>>>> <&gem_noc MASTER_APPSS_PROC 3 &gem_noc SLAVE_LLCC 3> (voting based on the bwmon4 inputs)
> 
> For sdm845 SLAVE_LLCC is in mem_noc, so I guess mc_virt on sc7280?

thats correct,

> 
>>>> and <&mc_virt MASTER_LLCC 3 &mc_virt SLAVE_EBI1 3> (voting based on the bwmon5 inputs)
>>>> and similar for sdm845 too.
>>>>
>>>> L3 should perhaps still be voted based on the cpu freq as done today.
>>>
>>> This would mean that original bandwidth values (800 - 7216 MB/s) were
>>> correct. However we have still your observation that bwmon kicks in very
>>> fast and my measurements that sampled bwmon data shows bandwidth ~20000
>>> MB/s.
>>
>> Right, thats because the bandwidth supported between the cpu<->llcc path is much higher
>> than the DDR frequencies. For instance on sc7280, I see (2288 - 15258 MB/s) for LLCC while
>> the DDR max is 8532 MB/s.
> 
> OK, that sounds right.
> 
> Another point is that I did not find actual scaling of throughput via
> that interconnect path:
> <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_LLCC 3>

Shouldn't this be <&gladiator_noc MASTER_APPSS_PROC 3 &gladiator_noc SLAVE_LLCC 3> on sdm845?

> 
> so I cannot test impact of bwmon that way.
> 
> Best regards,
> Krzysztof

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 4/4] arm64: dts: qcom: sdm845: Add CPU BWMON
@ 2022-06-28 15:20                         ` Rajendra Nayak
  0 siblings, 0 replies; 52+ messages in thread
From: Rajendra Nayak @ 2022-06-28 15:20 UTC (permalink / raw)
  To: Krzysztof Kozlowski, Bjorn Andersson
  Cc: Andy Gross, Georgi Djakov, Rob Herring, Catalin Marinas,
	Will Deacon, linux-arm-msm, linux-pm, devicetree, linux-kernel,
	linux-arm-kernel, Thara Gopinath



On 6/28/2022 7:32 PM, Krzysztof Kozlowski wrote:
> On 28/06/2022 15:15, Rajendra Nayak wrote:
>>
>>
>> On 6/28/2022 4:20 PM, Krzysztof Kozlowski wrote:
>>> On 28/06/2022 12:36, Rajendra Nayak wrote:
>>>>
>>>> On 6/27/2022 6:09 PM, Krzysztof Kozlowski wrote:
>>>>> On 26/06/2022 05:28, Bjorn Andersson wrote:
>>>>>> On Thu 23 Jun 07:58 CDT 2022, Krzysztof Kozlowski wrote:
>>>>>>
>>>>>>> On 23/06/2022 08:48, Rajendra Nayak wrote:
>>>>>>>>>>> diff --git a/arch/arm64/boot/dts/qcom/sdm845.dtsi b/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>>>>>>>>> index 83e8b63f0910..adffb9c70566 100644
>>>>>>>>>>> --- a/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>>>>>>>>> +++ b/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>>>>>>>>> @@ -2026,6 +2026,60 @@ llcc: system-cache-controller@1100000 {
>>>>>>>>>>>       			interrupts = <GIC_SPI 582 IRQ_TYPE_LEVEL_HIGH>;
>>>>>>>>>>>       		};
>>>>>>>>>>>       
>>>>>>>>>>> +		pmu@1436400 {
>>>>>>>>>>> +			compatible = "qcom,sdm845-cpu-bwmon";
>>>>>>>>>>> +			reg = <0 0x01436400 0 0x600>;
>>>>>>>>>>> +
>>>>>>>>>>> +			interrupts = <GIC_SPI 581 IRQ_TYPE_LEVEL_HIGH>;
>>>>>>>>>>> +
>>>>>>>>>>> +			interconnects = <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>,
>>>>>>>>>>> +					<&osm_l3 MASTER_OSM_L3_APPS &osm_l3 SLAVE_OSM_L3>;
>>>>>>>>>>> +			interconnect-names = "ddr", "l3c";
>>>>>>>>>>
>>>>>>>>>> Is this the pmu/bwmon instance between the cpu and caches or the one between the caches and DDR?
>>>>>>>>>
>>>>>>>>> To my understanding this is the one between CPU and caches.
>>>>>>>>
>>>>>>>> Ok, but then because the OPP table lists the DDR bw first and Cache bw second, isn't the driver
>>>>>>>> ending up comparing the bw values thrown by the pmu against the DDR bw instead of the Cache BW?
>>>>>>>
>>>>>>> I double checked now and you're right.
>>>>>>>
>>>>>>>> Atleast with my testing on sc7280 I found this to mess things up and I always was ending up at
>>>>>>>> higher OPPs even while the system was completely idle. Comparing the values against the Cache bw
>>>>>>>> fixed it.(sc7280 also has a bwmon4 instance between the cpu and caches and a bwmon5 between the cache
>>>>>>>> and DDR)
>>>>>>>
>>>>>>> In my case it exposes different issue - under performance. Somehow the
>>>>>>> bwmon does not report bandwidth high enough to vote for high bandwidth.
>>>>>>>
>>>>>>> After removing the DDR interconnect and bandwidth OPP values I have for:
>>>>>>> sysbench --threads=8 --time=60 --memory-total-size=20T --test=memory
>>>>>>> --memory-block-size=4M run
>>>>>>>
>>>>>>> 1. Vanilla: 29768 MB/s
>>>>>>> 2. Vanilla without CPU votes: 8728 MB/s
>>>>>>> 3. Previous bwmon (voting too high): 32007 MB/s
>>>>>>> 4. Fixed bwmon 24911 MB/s
>>>>>>> Bwmon does not vote for maximum L3 speed:
>>>>>>> bwmon report 9408 MB/s (thresholds set: <9216000 15052801>
>>>>>>> )
>>>>>>> osm l3 aggregate 14355 MBps -> 897 MHz, level 7, bw 14355 MBps
>>>>>>>
>>>>>>> Maybe that's just problem with missing governor which would vote for
>>>>>>> bandwidth rounding up or anticipating higher needs.
>>>>>>>
>>>>>>>>>> Depending on which one it is, shouldn;t we just be scaling either one and not both the interconnect paths?
>>>>>>>>>
>>>>>>>>> The interconnects are the same as ones used for CPU nodes, therefore if
>>>>>>>>> we want to scale both when scaling CPU, then we also want to scale both
>>>>>>>>> when seeing traffic between CPU and cache.
>>>>>>>>
>>>>>>>> Well, they were both associated with the CPU node because with no other input to decide on _when_
>>>>>>>> to scale the caches and DDR, we just put a mapping table which simply mapped a CPU freq to a L3 _and_
>>>>>>>> DDR freq. So with just one input (CPU freq) we decided on what should be both the L3 freq and DDR freq.
>>>>>>>>
>>>>>>>> Now with 2 pmu's, we have 2 inputs, so we can individually scale the L3 based on the cache PMU
>>>>>>>> counters and DDR based on the DDR PMU counters, no?
>>>>>>>>
>>>>>>>> Since you said you have plans to add the other pmu support as well (bwmon5 between the cache and DDR)
>>>>>>>> how else would you have the OPP table associated with that pmu instance? Would you again have both the
>>>>>>>> L3 and DDR scale based on the inputs from that bwmon too?
>>>>>>>
>>>>>>> Good point, thanks for sharing. I think you're right. I'll keep only the
>>>>>>> l3c interconnect path.
>>>>>>>
>>>>>>
>>>>>> If I understand correctly, <&osm_l3 MASTER_OSM_L3_APPS &osm_l3
>>>>>> SLAVE_OSM_L3> relates to the L3 cache speed, which sits inside the CPU
>>>>>> subsystem. As such traffic hitting this cache will not show up in either
>>>>>> bwmon instance.
>>>>>>
>>>>>> The path <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>
>>>>>> affects the DDR frequency. So the traffic measured by the cpu-bwmon
>>>>>> would be the CPU subsystems traffic that missed the L1/L2/L3 caches and
>>>>>> hits the memory bus towards DDR.
>>>>
>>>> That seems right, looking some more into the downstream code and register definitions
>>>> I see the 2 bwmon instances actually lie on the path outside CPU SS towards DDR,
>>>> first one (bwmon4) is between the CPUSS and LLCC (system cache) and the second one
>>>> (bwmon5) between LLCC and DDR. So we should use the counters from bwmon4 to
>>>> scale the CPU-LLCC path (and not L3), on sc7280 that would mean splitting the
>>>> <&gem_noc MASTER_APPSS_PROC 3 &mc_virt SLAVE_EBI1 3> into
>>>> <&gem_noc MASTER_APPSS_PROC 3 &gem_noc SLAVE_LLCC 3> (voting based on the bwmon4 inputs)
> 
> For sdm845 SLAVE_LLCC is in mem_noc, so I guess mc_virt on sc7280?

thats correct,

> 
>>>> and <&mc_virt MASTER_LLCC 3 &mc_virt SLAVE_EBI1 3> (voting based on the bwmon5 inputs)
>>>> and similar for sdm845 too.
>>>>
>>>> L3 should perhaps still be voted based on the cpu freq as done today.
>>>
>>> This would mean that original bandwidth values (800 - 7216 MB/s) were
>>> correct. However we have still your observation that bwmon kicks in very
>>> fast and my measurements that sampled bwmon data shows bandwidth ~20000
>>> MB/s.
>>
>> Right, thats because the bandwidth supported between the cpu<->llcc path is much higher
>> than the DDR frequencies. For instance on sc7280, I see (2288 - 15258 MB/s) for LLCC while
>> the DDR max is 8532 MB/s.
> 
> OK, that sounds right.
> 
> Another point is that I did not find actual scaling of throughput via
> that interconnect path:
> <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_LLCC 3>

Shouldn't this be <&gladiator_noc MASTER_APPSS_PROC 3 &gladiator_noc SLAVE_LLCC 3> on sdm845?

> 
> so I cannot test impact of bwmon that way.
> 
> Best regards,
> Krzysztof

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 4/4] arm64: dts: qcom: sdm845: Add CPU BWMON
  2022-06-28 15:20                         ` Rajendra Nayak
@ 2022-06-28 15:23                           ` Krzysztof Kozlowski
  -1 siblings, 0 replies; 52+ messages in thread
From: Krzysztof Kozlowski @ 2022-06-28 15:23 UTC (permalink / raw)
  To: Rajendra Nayak, Bjorn Andersson
  Cc: Andy Gross, Georgi Djakov, Rob Herring, Catalin Marinas,
	Will Deacon, linux-arm-msm, linux-pm, devicetree, linux-kernel,
	linux-arm-kernel, Thara Gopinath

On 28/06/2022 17:20, Rajendra Nayak wrote:
> 
> 
> On 6/28/2022 7:32 PM, Krzysztof Kozlowski wrote:
>> On 28/06/2022 15:15, Rajendra Nayak wrote:
>>>
>>>
>>> On 6/28/2022 4:20 PM, Krzysztof Kozlowski wrote:
>>>> On 28/06/2022 12:36, Rajendra Nayak wrote:
>>>>>
>>>>> On 6/27/2022 6:09 PM, Krzysztof Kozlowski wrote:
>>>>>> On 26/06/2022 05:28, Bjorn Andersson wrote:
>>>>>>> On Thu 23 Jun 07:58 CDT 2022, Krzysztof Kozlowski wrote:
>>>>>>>
>>>>>>>> On 23/06/2022 08:48, Rajendra Nayak wrote:
>>>>>>>>>>>> diff --git a/arch/arm64/boot/dts/qcom/sdm845.dtsi b/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>>>>>>>>>> index 83e8b63f0910..adffb9c70566 100644
>>>>>>>>>>>> --- a/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>>>>>>>>>> +++ b/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>>>>>>>>>> @@ -2026,6 +2026,60 @@ llcc: system-cache-controller@1100000 {
>>>>>>>>>>>>       			interrupts = <GIC_SPI 582 IRQ_TYPE_LEVEL_HIGH>;
>>>>>>>>>>>>       		};
>>>>>>>>>>>>       
>>>>>>>>>>>> +		pmu@1436400 {
>>>>>>>>>>>> +			compatible = "qcom,sdm845-cpu-bwmon";
>>>>>>>>>>>> +			reg = <0 0x01436400 0 0x600>;
>>>>>>>>>>>> +
>>>>>>>>>>>> +			interrupts = <GIC_SPI 581 IRQ_TYPE_LEVEL_HIGH>;
>>>>>>>>>>>> +
>>>>>>>>>>>> +			interconnects = <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>,
>>>>>>>>>>>> +					<&osm_l3 MASTER_OSM_L3_APPS &osm_l3 SLAVE_OSM_L3>;
>>>>>>>>>>>> +			interconnect-names = "ddr", "l3c";
>>>>>>>>>>>
>>>>>>>>>>> Is this the pmu/bwmon instance between the cpu and caches or the one between the caches and DDR?
>>>>>>>>>>
>>>>>>>>>> To my understanding this is the one between CPU and caches.
>>>>>>>>>
>>>>>>>>> Ok, but then because the OPP table lists the DDR bw first and Cache bw second, isn't the driver
>>>>>>>>> ending up comparing the bw values thrown by the pmu against the DDR bw instead of the Cache BW?
>>>>>>>>
>>>>>>>> I double checked now and you're right.
>>>>>>>>
>>>>>>>>> Atleast with my testing on sc7280 I found this to mess things up and I always was ending up at
>>>>>>>>> higher OPPs even while the system was completely idle. Comparing the values against the Cache bw
>>>>>>>>> fixed it.(sc7280 also has a bwmon4 instance between the cpu and caches and a bwmon5 between the cache
>>>>>>>>> and DDR)
>>>>>>>>
>>>>>>>> In my case it exposes different issue - under performance. Somehow the
>>>>>>>> bwmon does not report bandwidth high enough to vote for high bandwidth.
>>>>>>>>
>>>>>>>> After removing the DDR interconnect and bandwidth OPP values I have for:
>>>>>>>> sysbench --threads=8 --time=60 --memory-total-size=20T --test=memory
>>>>>>>> --memory-block-size=4M run
>>>>>>>>
>>>>>>>> 1. Vanilla: 29768 MB/s
>>>>>>>> 2. Vanilla without CPU votes: 8728 MB/s
>>>>>>>> 3. Previous bwmon (voting too high): 32007 MB/s
>>>>>>>> 4. Fixed bwmon 24911 MB/s
>>>>>>>> Bwmon does not vote for maximum L3 speed:
>>>>>>>> bwmon report 9408 MB/s (thresholds set: <9216000 15052801>
>>>>>>>> )
>>>>>>>> osm l3 aggregate 14355 MBps -> 897 MHz, level 7, bw 14355 MBps
>>>>>>>>
>>>>>>>> Maybe that's just problem with missing governor which would vote for
>>>>>>>> bandwidth rounding up or anticipating higher needs.
>>>>>>>>
>>>>>>>>>>> Depending on which one it is, shouldn;t we just be scaling either one and not both the interconnect paths?
>>>>>>>>>>
>>>>>>>>>> The interconnects are the same as ones used for CPU nodes, therefore if
>>>>>>>>>> we want to scale both when scaling CPU, then we also want to scale both
>>>>>>>>>> when seeing traffic between CPU and cache.
>>>>>>>>>
>>>>>>>>> Well, they were both associated with the CPU node because with no other input to decide on _when_
>>>>>>>>> to scale the caches and DDR, we just put a mapping table which simply mapped a CPU freq to a L3 _and_
>>>>>>>>> DDR freq. So with just one input (CPU freq) we decided on what should be both the L3 freq and DDR freq.
>>>>>>>>>
>>>>>>>>> Now with 2 pmu's, we have 2 inputs, so we can individually scale the L3 based on the cache PMU
>>>>>>>>> counters and DDR based on the DDR PMU counters, no?
>>>>>>>>>
>>>>>>>>> Since you said you have plans to add the other pmu support as well (bwmon5 between the cache and DDR)
>>>>>>>>> how else would you have the OPP table associated with that pmu instance? Would you again have both the
>>>>>>>>> L3 and DDR scale based on the inputs from that bwmon too?
>>>>>>>>
>>>>>>>> Good point, thanks for sharing. I think you're right. I'll keep only the
>>>>>>>> l3c interconnect path.
>>>>>>>>
>>>>>>>
>>>>>>> If I understand correctly, <&osm_l3 MASTER_OSM_L3_APPS &osm_l3
>>>>>>> SLAVE_OSM_L3> relates to the L3 cache speed, which sits inside the CPU
>>>>>>> subsystem. As such traffic hitting this cache will not show up in either
>>>>>>> bwmon instance.
>>>>>>>
>>>>>>> The path <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>
>>>>>>> affects the DDR frequency. So the traffic measured by the cpu-bwmon
>>>>>>> would be the CPU subsystems traffic that missed the L1/L2/L3 caches and
>>>>>>> hits the memory bus towards DDR.
>>>>>
>>>>> That seems right, looking some more into the downstream code and register definitions
>>>>> I see the 2 bwmon instances actually lie on the path outside CPU SS towards DDR,
>>>>> first one (bwmon4) is between the CPUSS and LLCC (system cache) and the second one
>>>>> (bwmon5) between LLCC and DDR. So we should use the counters from bwmon4 to
>>>>> scale the CPU-LLCC path (and not L3), on sc7280 that would mean splitting the
>>>>> <&gem_noc MASTER_APPSS_PROC 3 &mc_virt SLAVE_EBI1 3> into
>>>>> <&gem_noc MASTER_APPSS_PROC 3 &gem_noc SLAVE_LLCC 3> (voting based on the bwmon4 inputs)
>>
>> For sdm845 SLAVE_LLCC is in mem_noc, so I guess mc_virt on sc7280?
> 
> thats correct,
> 
>>
>>>>> and <&mc_virt MASTER_LLCC 3 &mc_virt SLAVE_EBI1 3> (voting based on the bwmon5 inputs)
>>>>> and similar for sdm845 too.
>>>>>
>>>>> L3 should perhaps still be voted based on the cpu freq as done today.
>>>>
>>>> This would mean that original bandwidth values (800 - 7216 MB/s) were
>>>> correct. However we have still your observation that bwmon kicks in very
>>>> fast and my measurements that sampled bwmon data shows bandwidth ~20000
>>>> MB/s.
>>>
>>> Right, thats because the bandwidth supported between the cpu<->llcc path is much higher
>>> than the DDR frequencies. For instance on sc7280, I see (2288 - 15258 MB/s) for LLCC while
>>> the DDR max is 8532 MB/s.
>>
>> OK, that sounds right.
>>
>> Another point is that I did not find actual scaling of throughput via
>> that interconnect path:
>> <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_LLCC 3>
> 
> Shouldn't this be <&gladiator_noc MASTER_APPSS_PROC 3 &gladiator_noc SLAVE_LLCC 3> on sdm845?

When I tried this, I got icc xlate error. If I read the code correctly,
it's in mem_noc:
https://elixir.bootlin.com/linux/v5.19-rc4/source/drivers/interconnect/qcom/sdm845.c#L349

Best regards,
Krzysztof

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v4 4/4] arm64: dts: qcom: sdm845: Add CPU BWMON
@ 2022-06-28 15:23                           ` Krzysztof Kozlowski
  0 siblings, 0 replies; 52+ messages in thread
From: Krzysztof Kozlowski @ 2022-06-28 15:23 UTC (permalink / raw)
  To: Rajendra Nayak, Bjorn Andersson
  Cc: Andy Gross, Georgi Djakov, Rob Herring, Catalin Marinas,
	Will Deacon, linux-arm-msm, linux-pm, devicetree, linux-kernel,
	linux-arm-kernel, Thara Gopinath

On 28/06/2022 17:20, Rajendra Nayak wrote:
> 
> 
> On 6/28/2022 7:32 PM, Krzysztof Kozlowski wrote:
>> On 28/06/2022 15:15, Rajendra Nayak wrote:
>>>
>>>
>>> On 6/28/2022 4:20 PM, Krzysztof Kozlowski wrote:
>>>> On 28/06/2022 12:36, Rajendra Nayak wrote:
>>>>>
>>>>> On 6/27/2022 6:09 PM, Krzysztof Kozlowski wrote:
>>>>>> On 26/06/2022 05:28, Bjorn Andersson wrote:
>>>>>>> On Thu 23 Jun 07:58 CDT 2022, Krzysztof Kozlowski wrote:
>>>>>>>
>>>>>>>> On 23/06/2022 08:48, Rajendra Nayak wrote:
>>>>>>>>>>>> diff --git a/arch/arm64/boot/dts/qcom/sdm845.dtsi b/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>>>>>>>>>> index 83e8b63f0910..adffb9c70566 100644
>>>>>>>>>>>> --- a/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>>>>>>>>>> +++ b/arch/arm64/boot/dts/qcom/sdm845.dtsi
>>>>>>>>>>>> @@ -2026,6 +2026,60 @@ llcc: system-cache-controller@1100000 {
>>>>>>>>>>>>       			interrupts = <GIC_SPI 582 IRQ_TYPE_LEVEL_HIGH>;
>>>>>>>>>>>>       		};
>>>>>>>>>>>>       
>>>>>>>>>>>> +		pmu@1436400 {
>>>>>>>>>>>> +			compatible = "qcom,sdm845-cpu-bwmon";
>>>>>>>>>>>> +			reg = <0 0x01436400 0 0x600>;
>>>>>>>>>>>> +
>>>>>>>>>>>> +			interrupts = <GIC_SPI 581 IRQ_TYPE_LEVEL_HIGH>;
>>>>>>>>>>>> +
>>>>>>>>>>>> +			interconnects = <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>,
>>>>>>>>>>>> +					<&osm_l3 MASTER_OSM_L3_APPS &osm_l3 SLAVE_OSM_L3>;
>>>>>>>>>>>> +			interconnect-names = "ddr", "l3c";
>>>>>>>>>>>
>>>>>>>>>>> Is this the pmu/bwmon instance between the cpu and caches or the one between the caches and DDR?
>>>>>>>>>>
>>>>>>>>>> To my understanding this is the one between CPU and caches.
>>>>>>>>>
>>>>>>>>> Ok, but then because the OPP table lists the DDR bw first and Cache bw second, isn't the driver
>>>>>>>>> ending up comparing the bw values thrown by the pmu against the DDR bw instead of the Cache BW?
>>>>>>>>
>>>>>>>> I double checked now and you're right.
>>>>>>>>
>>>>>>>>> Atleast with my testing on sc7280 I found this to mess things up and I always was ending up at
>>>>>>>>> higher OPPs even while the system was completely idle. Comparing the values against the Cache bw
>>>>>>>>> fixed it.(sc7280 also has a bwmon4 instance between the cpu and caches and a bwmon5 between the cache
>>>>>>>>> and DDR)
>>>>>>>>
>>>>>>>> In my case it exposes different issue - under performance. Somehow the
>>>>>>>> bwmon does not report bandwidth high enough to vote for high bandwidth.
>>>>>>>>
>>>>>>>> After removing the DDR interconnect and bandwidth OPP values I have for:
>>>>>>>> sysbench --threads=8 --time=60 --memory-total-size=20T --test=memory
>>>>>>>> --memory-block-size=4M run
>>>>>>>>
>>>>>>>> 1. Vanilla: 29768 MB/s
>>>>>>>> 2. Vanilla without CPU votes: 8728 MB/s
>>>>>>>> 3. Previous bwmon (voting too high): 32007 MB/s
>>>>>>>> 4. Fixed bwmon 24911 MB/s
>>>>>>>> Bwmon does not vote for maximum L3 speed:
>>>>>>>> bwmon report 9408 MB/s (thresholds set: <9216000 15052801>
>>>>>>>> )
>>>>>>>> osm l3 aggregate 14355 MBps -> 897 MHz, level 7, bw 14355 MBps
>>>>>>>>
>>>>>>>> Maybe that's just problem with missing governor which would vote for
>>>>>>>> bandwidth rounding up or anticipating higher needs.
>>>>>>>>
>>>>>>>>>>> Depending on which one it is, shouldn;t we just be scaling either one and not both the interconnect paths?
>>>>>>>>>>
>>>>>>>>>> The interconnects are the same as ones used for CPU nodes, therefore if
>>>>>>>>>> we want to scale both when scaling CPU, then we also want to scale both
>>>>>>>>>> when seeing traffic between CPU and cache.
>>>>>>>>>
>>>>>>>>> Well, they were both associated with the CPU node because with no other input to decide on _when_
>>>>>>>>> to scale the caches and DDR, we just put a mapping table which simply mapped a CPU freq to a L3 _and_
>>>>>>>>> DDR freq. So with just one input (CPU freq) we decided on what should be both the L3 freq and DDR freq.
>>>>>>>>>
>>>>>>>>> Now with 2 pmu's, we have 2 inputs, so we can individually scale the L3 based on the cache PMU
>>>>>>>>> counters and DDR based on the DDR PMU counters, no?
>>>>>>>>>
>>>>>>>>> Since you said you have plans to add the other pmu support as well (bwmon5 between the cache and DDR)
>>>>>>>>> how else would you have the OPP table associated with that pmu instance? Would you again have both the
>>>>>>>>> L3 and DDR scale based on the inputs from that bwmon too?
>>>>>>>>
>>>>>>>> Good point, thanks for sharing. I think you're right. I'll keep only the
>>>>>>>> l3c interconnect path.
>>>>>>>>
>>>>>>>
>>>>>>> If I understand correctly, <&osm_l3 MASTER_OSM_L3_APPS &osm_l3
>>>>>>> SLAVE_OSM_L3> relates to the L3 cache speed, which sits inside the CPU
>>>>>>> subsystem. As such traffic hitting this cache will not show up in either
>>>>>>> bwmon instance.
>>>>>>>
>>>>>>> The path <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_EBI1 3>
>>>>>>> affects the DDR frequency. So the traffic measured by the cpu-bwmon
>>>>>>> would be the CPU subsystems traffic that missed the L1/L2/L3 caches and
>>>>>>> hits the memory bus towards DDR.
>>>>>
>>>>> That seems right, looking some more into the downstream code and register definitions
>>>>> I see the 2 bwmon instances actually lie on the path outside CPU SS towards DDR,
>>>>> first one (bwmon4) is between the CPUSS and LLCC (system cache) and the second one
>>>>> (bwmon5) between LLCC and DDR. So we should use the counters from bwmon4 to
>>>>> scale the CPU-LLCC path (and not L3), on sc7280 that would mean splitting the
>>>>> <&gem_noc MASTER_APPSS_PROC 3 &mc_virt SLAVE_EBI1 3> into
>>>>> <&gem_noc MASTER_APPSS_PROC 3 &gem_noc SLAVE_LLCC 3> (voting based on the bwmon4 inputs)
>>
>> For sdm845 SLAVE_LLCC is in mem_noc, so I guess mc_virt on sc7280?
> 
> thats correct,
> 
>>
>>>>> and <&mc_virt MASTER_LLCC 3 &mc_virt SLAVE_EBI1 3> (voting based on the bwmon5 inputs)
>>>>> and similar for sdm845 too.
>>>>>
>>>>> L3 should perhaps still be voted based on the cpu freq as done today.
>>>>
>>>> This would mean that original bandwidth values (800 - 7216 MB/s) were
>>>> correct. However we have still your observation that bwmon kicks in very
>>>> fast and my measurements that sampled bwmon data shows bandwidth ~20000
>>>> MB/s.
>>>
>>> Right, thats because the bandwidth supported between the cpu<->llcc path is much higher
>>> than the DDR frequencies. For instance on sc7280, I see (2288 - 15258 MB/s) for LLCC while
>>> the DDR max is 8532 MB/s.
>>
>> OK, that sounds right.
>>
>> Another point is that I did not find actual scaling of throughput via
>> that interconnect path:
>> <&gladiator_noc MASTER_APPSS_PROC 3 &mem_noc SLAVE_LLCC 3>
> 
> Shouldn't this be <&gladiator_noc MASTER_APPSS_PROC 3 &gladiator_noc SLAVE_LLCC 3> on sdm845?

When I tried this, I got icc xlate error. If I read the code correctly,
it's in mem_noc:
https://elixir.bootlin.com/linux/v5.19-rc4/source/drivers/interconnect/qcom/sdm845.c#L349

Best regards,
Krzysztof

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2022-06-28 15:26 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-01 10:11 [PATCH v4 0/4] soc/arm64: qcom: Add initial version of bwmon Krzysztof Kozlowski
2022-06-01 10:11 ` Krzysztof Kozlowski
2022-06-01 10:11 ` [PATCH v4 1/4] dt-bindings: interconnect: qcom,sdm845-cpu-bwmon: add BWMON device Krzysztof Kozlowski
2022-06-01 10:11   ` Krzysztof Kozlowski
2022-06-06 21:11   ` Bjorn Andersson
2022-06-06 21:11     ` Bjorn Andersson
2022-06-07  6:50     ` Krzysztof Kozlowski
2022-06-07  6:50       ` Krzysztof Kozlowski
2022-06-22 11:58       ` Rajendra Nayak
2022-06-22 11:58         ` Rajendra Nayak
2022-06-22 12:20         ` Krzysztof Kozlowski
2022-06-22 12:20           ` Krzysztof Kozlowski
2022-06-26  3:19         ` Bjorn Andersson
2022-06-26  3:19           ` Bjorn Andersson
2022-06-28 10:43           ` Rajendra Nayak
2022-06-28 10:43             ` Rajendra Nayak
2022-06-01 10:11 ` [PATCH v4 2/4] soc: qcom: icc-bwmon: Add bandwidth monitoring driver Krzysztof Kozlowski
2022-06-01 10:11   ` Krzysztof Kozlowski
2022-06-06 16:35   ` Georgi Djakov
2022-06-06 16:35     ` Georgi Djakov
2022-06-01 10:11 ` [PATCH v4 3/4] arm64: defconfig: enable Qualcomm Bandwidth Monitor Krzysztof Kozlowski
2022-06-01 10:11   ` Krzysztof Kozlowski
2022-06-01 10:11 ` [PATCH v4 4/4] arm64: dts: qcom: sdm845: Add CPU BWMON Krzysztof Kozlowski
2022-06-01 10:11   ` Krzysztof Kozlowski
2022-06-06 20:39   ` Georgi Djakov
2022-06-06 20:39     ` Georgi Djakov
2022-06-07  6:48     ` Krzysztof Kozlowski
2022-06-07  6:48       ` Krzysztof Kozlowski
2022-06-22 11:46   ` Rajendra Nayak
2022-06-22 11:46     ` Rajendra Nayak
2022-06-22 13:52     ` Krzysztof Kozlowski
2022-06-22 13:52       ` Krzysztof Kozlowski
2022-06-23  6:48       ` Rajendra Nayak
2022-06-23  6:48         ` Rajendra Nayak
2022-06-23 12:58         ` Krzysztof Kozlowski
2022-06-23 12:58           ` Krzysztof Kozlowski
2022-06-26  3:28           ` Bjorn Andersson
2022-06-26  3:28             ` Bjorn Andersson
2022-06-27 12:39             ` Krzysztof Kozlowski
2022-06-27 12:39               ` Krzysztof Kozlowski
2022-06-28 10:36               ` Rajendra Nayak
2022-06-28 10:36                 ` Rajendra Nayak
2022-06-28 10:50                 ` Krzysztof Kozlowski
2022-06-28 10:50                   ` Krzysztof Kozlowski
2022-06-28 13:15                   ` Rajendra Nayak
2022-06-28 13:15                     ` Rajendra Nayak
2022-06-28 14:02                     ` Krzysztof Kozlowski
2022-06-28 14:02                       ` Krzysztof Kozlowski
2022-06-28 15:20                       ` Rajendra Nayak
2022-06-28 15:20                         ` Rajendra Nayak
2022-06-28 15:23                         ` Krzysztof Kozlowski
2022-06-28 15:23                           ` Krzysztof Kozlowski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.