From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1751497AbcBJOni (ORCPT <rfc822;w@1wt.eu>);
	Wed, 10 Feb 2016 09:43:38 -0500
Received: from ns.horizon.com ([71.41.210.147]:21782 "HELO ns.horizon.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with SMTP
	id S1750897AbcBJOng (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
	Wed, 10 Feb 2016 09:43:36 -0500
Date: 10 Feb 2016 09:43:34 -0500
Message-ID: <20160210144334.23242.qmail@ns.horizon.com>
From: "George Spelvin" <linux@horizon.com>
To: David.Laight@ACULAB.COM, linux-kernel@vger.kernel.org, linux@horizon.com,
        netdev@vger.kernel.org, tom@herbertland.com
Subject: RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
Cc: mingo@kernel.org
In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6D1CCDCC8D@AcuExch.aculab.com>
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

David Laight wrote:
> Separate renaming allows:
> 1) The value to tested without waiting for pending updates to complete.
>    Useful for IE and DIR.

I don't quite follow.  It allows the value to be tested without waiting
for pending updates *of other bits* to complete.

Obviusly, the update of the bit being tested has to complete!

> I can't see any obvious gain from separating out O or Z (even with
> adcx and adox). You'd need some other instructions that don't set O (or Z)
> but set some other useful flags.
> (A decrement that only set Z for instance.)

I tried to describe the advantages in the previous message.

The problems arise much less often than the INC/DEC pair, but there are
instructions whick write only the O and C flags, (ROL, ROR) and only
the Z flag (CMPXCHG).

The sign, aux carry, and parity flags are *always* updated as
a group, so they can be renamed as a group.

> While LOOP could be used on Bulldozer+ an equivalently fast loop
> can be done with inc/dec and jnz.
> So you only care about LOOP/JCXZ when ADOX is supported.
> 
> I think the fastest loop is:
> 10:	adc	%rax,0(%rdi,%rcx,8)
> 	inc	%rcx
> 	jnz	10b
> but check if any cpu add an extra clock for the 'scaled' offset
> (they might be faster if %rdi is incremented).
> That loop looks like it will have no overhead on recent cpu.

Well, it should execute at 1 instruction/cycle.  (No, a scaled offset
doesn't take extra time.)  To break that requires ADCX/ADOX:

10:	adcxq	0(%rdi,%rcx),%rax
	adoxq	8(%rdi,%rcx),%rdx
 	leaq	16(%rcx),%rcx
	jrcxz	11f
 	j	10b
11: