From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S932241AbcBIKu5 (ORCPT <rfc822;w@1wt.eu>);
	Tue, 9 Feb 2016 05:50:57 -0500
Received: from smtp-out6.electric.net ([192.162.217.191]:57952 "EHLO
	smtp-out6.electric.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1750781AbcBIKuz convert rfc822-to-8bit (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Tue, 9 Feb 2016 05:50:55 -0500
From: David Laight <David.Laight@ACULAB.COM>
To: "'George Spelvin'" <linux@horizon.com>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        "netdev@vger.kernel.org" <netdev@vger.kernel.org>,
        "tom@herbertland.com" <tom@herbertland.com>
CC: "mingo@kernel.org" <mingo@kernel.org>
Subject: RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
Thread-Topic: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
Thread-Index: AQHRYqy27Wbj7StuQUCi5Ndt7TrMLp8jhE3Q
Date: Tue, 9 Feb 2016 10:48:14 +0000
Message-ID: <063D6719AE5E284EB5DD2968C1650D6D1CCDBCA5@AcuExch.aculab.com>
References: <20160208201234.8569.qmail@ns.horizon.com>
In-Reply-To: <20160208201234.8569.qmail@ns.horizon.com>
Accept-Language: en-GB, en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
x-originating-ip: [10.202.99.200]
Content-Type: text/plain; charset="Windows-1252"
Content-Transfer-Encoding: 8BIT
MIME-Version: 1.0
X-Outbound-IP: 213.249.233.130
X-Env-From: David.Laight@ACULAB.COM
X-PolicySMART: 3396946, 3397078
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

From: George Spelvin [mailto:linux@horizon.com]
> Sent: 08 February 2016 20:13
> David Laight wrote:
> > I'd need convincing that unrolling the loop like that gives any significant gain.
> > You have a dependency chain on the carry flag so have delays between the 'adcq'
> > instructions (these may be more significant than the memory reads from l1 cache).
> 
> If the carry chain is a bottleneck, on Broadwell+ (feature flag
> X86_FEATURE_ADX), there are the ADCX and ADOX instructions, which use
> separate flag bits for their carry chains and so can be interleaved.
> 
> I don't have such a machine to test on, but if someone who does
> would like to do a little benchmarking, that would be an interesting
> data point.
> 
> Unfortunately, that means yet another version of the main loop,
> but if there's a significant benefit...

Well, the only part actually worth writing in assembler is the 'adc' loop.
So run-time substitution of separate versions (as is done for memcpy())
wouldn't be hard.

Since adcx and adox must execute in parallel I clearly need to re-remember
how dependencies against the flags register work. I'm sure I remember
issues with 'false dependencies' against the flags.

However you still need a loop construct that doesn't modify 'o' or 'c'.
Using leal, jcxz, jmp might work.
(Unless broadwell actually has a fast 'loop' instruction.)

(I've not got a suitable test cpu.)

	David