From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S966911AbcA1P3A (ORCPT <rfc822;w@1wt.eu>);
	Thu, 28 Jan 2016 10:29:00 -0500
Received: from casper.infradead.org ([85.118.1.10]:42374 "EHLO
	casper.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S965759AbcA1P2z (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Thu, 28 Jan 2016 10:28:55 -0500
Date: Thu, 28 Jan 2016 16:28:48 +0100
From: Peter Zijlstra <peterz@infradead.org>
To: Borislav Petkov <bp@alien8.de>
Cc: Huang Rui <ray.huang@amd.com>, Borislav Petkov <bp@suse.de>,
        Ingo Molnar <mingo@kernel.org>, Andy Lutomirski <luto@amacapital.net>,
        Thomas Gleixner <tglx@linutronix.de>, Robert Richter <rric@kernel.org>,
        Jacob Shin <jacob.w.shin@gmail.com>,
        John Stultz <john.stultz@linaro.org>,
        =?iso-8859-1?Q?Fr=E9d=E9ric?= Weisbecker <fweisbec@gmail.com>,
        linux-kernel@vger.kernel.org, spg_linux_kernel@amd.com, x86@kernel.org,
        Guenter Roeck <linux@roeck-us.net>,
        Andreas Herrmann <herrmann.der.user@googlemail.com>,
        Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>,
        Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com>,
        Fengguang Wu <fengguang.wu@intel.com>, Aaron Lu <aaron.lu@intel.com>
Subject: Re: [PATCH v4] perf/x86/amd/power: Add AMD accumulated power
 reporting mechanism
Message-ID: <20160128152848.GT6356@twins.programming.kicks-ass.net>
References: <1453963131-2013-1-git-send-email-ray.huang@amd.com>
 <20160128090314.GB14274@pd.tnic>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20160128090314.GB14274@pd.tnic>
User-Agent: Mutt/1.5.21 (2012-12-30)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Thu, Jan 28, 2016 at 10:03:15AM +0100, Borislav Petkov wrote:

> +
> +struct power_pmu {
> +	raw_spinlock_t		lock;

Now that the list is gone, what does this thing protect?

> +	struct pmu		*pmu;

This member seems superfluous, there's only the one possible value.

> +	local64_t		cpu_sw_pwr_ptsc;
> +
> +	/*
> +	 * These two cpumasks are used for avoiding the allocations on the
> +	 * CPU_STARTING phase because power_cpu_prepare() will be called with
> +	 * IRQs disabled.
> +	 */
> +	cpumask_var_t		mask;
> +	cpumask_var_t		tmp_mask;
> +};
> +
> +static struct pmu pmu_class;
> +
> +/*
> + * Accumulated power represents the sum of each compute unit's (CU) power
> + * consumption. On any core of each CU we read the total accumulated power from
> + * MSR_F15H_CU_PWR_ACCUMULATOR. cpu_mask represents CPU bit map of all cores
> + * which are picked to measure the power for the CUs they belong to.
> + */
> +static cpumask_t cpu_mask;
> +
> +static DEFINE_PER_CPU(struct power_pmu *, amd_power_pmu);
> +
> +static u64 event_update(struct perf_event *event, struct power_pmu *pmu)
> +{

Is there ever a case where @pmu != __this_cpu_read(power_pmu) ?

> +	struct hw_perf_event *hwc = &event->hw;
> +	u64 prev_raw_count, new_raw_count, prev_ptsc, new_ptsc;
> +	u64 delta, tdelta;
> +
> +again:
> +	prev_raw_count = local64_read(&hwc->prev_count);
> +	prev_ptsc = local64_read(&pmu->cpu_sw_pwr_ptsc);
> +	rdmsrl(event->hw.event_base, new_raw_count);

Is hw.event_base != MSR_F15H_CU_PWR_ACCUMULATOR possible?

> +	rdmsrl(MSR_F15H_PTSC, new_ptsc);


Also, I suspect this doesn't do what you expect it to do.

We measure per-event PWR_ACC deltas, but per CPU PTSC values. These do
not match when there's more than 1 event on the CPU.

I would suggest adding a new struct to the hw_perf_event union with the
two u64 deltas like:

	struct { /* amd_power */
		u64 pwr_acc;
		u64 ptsc;
	};

And track these values per-event.

> +
> +	if (local64_cmpxchg(&hwc->prev_count, prev_raw_count,
> +			    new_raw_count) != prev_raw_count) {
> +		cpu_relax();
> +		goto again;
> +	}
> +
> +	/*
> +	 * Calculate the CU power consumption over a time period, the unit of
> +	 * final value (delta) is micro-Watts. Then add it to the event count.
> +	 */
> +	if (new_raw_count < prev_raw_count) {
> +		delta = max_cu_acc_power + new_raw_count;
> +		delta -= prev_raw_count;
> +	} else
> +		delta = new_raw_count - prev_raw_count;
> +
> +	delta *= cpu_pwr_sample_ratio * 1000;
> +	tdelta = new_ptsc - prev_ptsc;
> +
> +	do_div(delta, tdelta);
> +	local64_add(delta, &event->count);

Then this division can be redone on the total values, that looses less
precision over-all.

> +
> +	return new_raw_count;
> +}