Re: [RFC PATCH v2 18/19] iommu/arm-smmu-v3: Reduce contention during command-queue insertion

From: Will Deacon <will@kernel.org>
To: John Garry <john.garry@huawei.com>
Cc: Vijay Kilary <vkilari@codeaurora.org>,
	Jean-Philippe Brucker <jean-philippe.brucker@arm.com>,
	Jon Masters <jcm@redhat.com>, Jan Glauber <jglauber@marvell.com>,
	Alex Williamson <alex.williamson@redhat.com>,
	iommu@lists.linux-foundation.org,
	Jayachandran Chandrasekharan Nair <jnair@marvell.com>,
	Robin Murphy <robin.murphy@arm.com>
Subject: Re: [RFC PATCH v2 18/19] iommu/arm-smmu-v3: Reduce contention during command-queue insertion
Date: Wed, 24 Jul 2019 15:33:56 +0100	[thread overview]
Message-ID: <20190724143355.r2zw6z37igwav2ki@willie-the-truck> (raw)
In-Reply-To: <b6302fdf-29ef-0aa2-ae7a-ed21c506c6ec@huawei.com>

Hi again, John,

On Wed, Jul 24, 2019 at 09:20:49AM +0100, John Garry wrote:
> On 11/07/2019 18:19, Will Deacon wrote:
> > +static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
> > +				       u64 *cmds, int n, bool sync)
> > +{
> > +	u64 cmd_sync[CMDQ_ENT_DWORDS];
> > +	u32 prod;
> >  	unsigned long flags;
> > -	bool wfe = !!(smmu->features & ARM_SMMU_FEAT_SEV);
> > -	struct arm_smmu_cmdq_ent ent = { .opcode = CMDQ_OP_CMD_SYNC };
> > -	int ret;
> > +	bool owner;
> > +	struct arm_smmu_cmdq *cmdq = &smmu->cmdq;
> > +	struct arm_smmu_ll_queue llq = {
> > +		.max_n_shift = cmdq->q.llq.max_n_shift,
> > +	}, head = llq;
> > +	int ret = 0;
> > 
> > -	arm_smmu_cmdq_build_cmd(cmd, &ent);
> > +	/* 1. Allocate some space in the queue */
> > +	local_irq_save(flags);
> > +	llq.val = READ_ONCE(cmdq->q.llq.val);
> > +	do {
> > +		u64 old;
> > +
> > +		while (!queue_has_space(&llq, n + sync)) {
> > +			local_irq_restore(flags);
> > +			if (arm_smmu_cmdq_poll_until_not_full(smmu, &llq))
> > +				dev_err_ratelimited(smmu->dev, "CMDQ timeout\n");
> > +			local_irq_save(flags);
> > +		}
> > +
> > +		head.cons = llq.cons;
> > +		head.prod = queue_inc_prod_n(&llq, n + sync) |
> > +					     CMDQ_PROD_OWNED_FLAG;
> > +
> > +		old = cmpxchg_relaxed(&cmdq->q.llq.val, llq.val, head.val);
> 
> I added some basic debug to the stress test on your branch, and this cmpxchg
> was failing ~10 times on average on my D06.
> 
> So we're not using the spinlock now, but this cmpxchg may lack fairness.

It definitely lacks fairness, although that's going to be the case in many
other places where locking is elided as well. If it shows up as an issue, we
should try to address it, but queueing means serialisation and that largely
defeats the point of this patch. I also don't expect it to be a problem
unless the cmpxchg() is heavily contended, which shouldn't be the case if
we're batching.

> Since we're batching commands, I wonder if it's better to restore the
> spinlock and send batched commands + CMD_SYNC under the lock, and then wait
> for the CMD_SYNC completion outside the lock.

Again, we'd need some numbers, but my concern with that approach is that
we're serialising CPUs which is what I've been trying hard to avoid. It
also doesn't let you straightforwardly remove the cmpxchg() loop, since
the owner clears the OWNED flag and you probably wouldn't want to hold
the lock for that.

> I don't know if it improves the queue contetion, but at least the prod
> pointer would be more closely track the issued commands, such that we're not
> waiting to kick off many gathered batches of commands, while the SMMU HW may
> be idle (in terms of command processing).

Again, probably going to need some numbers to change this, although it
sounds like your other suggestion about having the owner move prod twice
would largely address your concerns. Reintroducing the lock, on the other
hand, feels like a big step backwards to me, and the whole reason I started
down the current route was because of vague claims that the locking was a
problem for large systems.

Thanks,

Will
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu