All of lore.kernel.org
 help / color / mirror / Atom feed
* ptrlist-iterator performance on one wine source file
@ 2017-07-27 15:05 Christopher Li
  2017-07-29 13:01 ` Luc Van Oostenryck
  0 siblings, 1 reply; 33+ messages in thread
From: Christopher Li @ 2017-07-27 15:05 UTC (permalink / raw)
  To: Luc Van Oostenryck, Linux-Sparse, Dibyendu Majumdar

Hi Luc,

Per our discussion, here is some test you are interested.
It show some measurable difference in ptrlist-iterator vs the
base before the patch series.

Step to reproduce:

git clone git://source.winehq.org/git/wine.git
cd win/dlls/usp10/tests

The test command:

time sparse -m64 -c -o usp10.o usp10.c -I. -I../../../include
-D__WINESRC__ -D_REENTRANT -fPIC -Wall -pipe -fno-strict-aliasing
-Wdeclaration-after-statement -Wempty-body -Wignored-qualifiers
-Wshift-overflow=2 -Wstrict-prototypes -Wtype-limits
-Wunused-but-set-parameter -Wvla -Wwrite-strings -Wpointer-arith
-Wlogical-op -gdwarf-2 -gstrict-dwarf -g -O2

I run the test on ptrlist-iterator from the git url you send out.
Here is 3 runs each on ptrlist-iterator and the base line commit.

with ptrlist-iterator d55cac721a060882df02f40aab8824d210d9dc6f

real 0m25.764s
user 0m25.430s
sys 0m0.227s

real 0m25.746s
user 0m25.424s
sys 0m0.219s

real 0m25.800s
user 0m25.326s
sys 0m0.201s

with base(rc4) ec3f72e981792a86a9e002471a06d61ecd5c6675

real 0m23.625s
user 0m23.311s
sys 0m0.190s

real 0m23.489s
user 0m23.002s
sys 0m0.195s

real 0m23.321s
user 0m23.043s
sys 0m0.185s


Chris

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: ptrlist-iterator performance on one wine source file
  2017-07-27 15:05 ptrlist-iterator performance on one wine source file Christopher Li
@ 2017-07-29 13:01 ` Luc Van Oostenryck
  2017-07-29 15:53   ` Christopher Li
  0 siblings, 1 reply; 33+ messages in thread
From: Luc Van Oostenryck @ 2017-07-29 13:01 UTC (permalink / raw)
  To: Christopher Li; +Cc: Linux-Sparse, Dibyendu Majumdar

On Thu, Jul 27, 2017 at 11:05:38AM -0400, Christopher Li wrote:
> Hi Luc,
> 
> Per our discussion, here is some test you are interested.
> It show some measurable difference in ptrlist-iterator vs the
> base before the patch series.

Interesting.
My measurements showed much smaller differences (around 1-2%
and smaller than the stdev). But I also saw that this difference
was significantly bigger on older/less performant machines
(which is not very surprising).

Anyway, this series in itself won't solve the issues we
currently have with deletion in nested ptrlist walking loops.

Thanks for the numbers.
-- Luc

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: ptrlist-iterator performance on one wine source file
  2017-07-29 13:01 ` Luc Van Oostenryck
@ 2017-07-29 15:53   ` Christopher Li
  2017-07-29 16:04     ` Luc Van Oostenryck
  0 siblings, 1 reply; 33+ messages in thread
From: Christopher Li @ 2017-07-29 15:53 UTC (permalink / raw)
  To: Luc Van Oostenryck; +Cc: Linux-Sparse, Dibyendu Majumdar

On Sat, Jul 29, 2017 at 9:01 AM, Luc Van Oostenryck
<luc.vanoostenryck@gmail.com> wrote:
>
> Interesting.
> My measurements showed much smaller differences (around 1-2%
> and smaller than the stdev). But I also saw that this difference
> was significantly bigger on older/less performant machines
> (which is not very surprising).

I guess my laptop can qualify as older/less performance machines :-)

A wild guess is that the modern CPU are better doing caching and
branch predictions etcs. The iterator is doing a lot of memory store
and load from the iterator struct, and a lot of call to the interator advance
function.

The reason it show up more on this C file is that, sparse is doing a lot
of recursive find domanitor store etc. There are very deep level of nested
ptrlist loop here. The performance penalty on the inner loop will amplify
by the level of nested loop.

BTW, I am writing some code to construct the dominator tree for
basic blocks. I think it will be useful to avoid repeat finding dominator.
It will also help you finding the dominator frontier for the phi locations.

Another possible use is that, we can do incremental remove of dead
blocks without visiting all the blocks. We can just find out what other
blocks dominated by this dead block, mark them for delete and repeat.

Some materials for the next release.

Chris

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: ptrlist-iterator performance on one wine source file
  2017-07-29 15:53   ` Christopher Li
@ 2017-07-29 16:04     ` Luc Van Oostenryck
  2017-07-29 16:25       ` Christopher Li
  0 siblings, 1 reply; 33+ messages in thread
From: Luc Van Oostenryck @ 2017-07-29 16:04 UTC (permalink / raw)
  To: Christopher Li; +Cc: Linux-Sparse, Dibyendu Majumdar

On Sat, Jul 29, 2017 at 5:53 PM, Christopher Li <sparse@chrisli.org> wrote:
> On Sat, Jul 29, 2017 at 9:01 AM, Luc Van Oostenryck
> <luc.vanoostenryck@gmail.com> wrote:
>>
>> Interesting.
>> My measurements showed much smaller differences (around 1-2%
>> and smaller than the stdev). But I also saw that this difference
>> was significantly bigger on older/less performant machines
>> (which is not very surprising).
>
> I guess my laptop can qualify as older/less performance machines :-)

Hehe, maybe :)

> A wild guess is that the modern CPU are better doing caching and
> branch predictions etcs.

Indeed.

> BTW, I am writing some code to construct the dominator tree for
> basic blocks. I think it will be useful to avoid repeat finding dominator.
> It will also help you finding the dominator frontier for the phi locations.

I don't need it but it's indeed something that is generally needed.

> Another possible use is that, we can do incremental remove of dead
> blocks without visiting all the blocks.

You can't do that once cycles are involved. You need something
like the marking algorithm used by kill_unreachable_bbs() for that.
And it was such a  cycle that created the problem with the false
"crazy programmer" warning.

-- Luc

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: ptrlist-iterator performance on one wine source file
  2017-07-29 16:04     ` Luc Van Oostenryck
@ 2017-07-29 16:25       ` Christopher Li
  2017-07-29 16:30         ` Christopher Li
  2017-07-29 16:35         ` Luc Van Oostenryck
  0 siblings, 2 replies; 33+ messages in thread
From: Christopher Li @ 2017-07-29 16:25 UTC (permalink / raw)
  To: Luc Van Oostenryck; +Cc: Linux-Sparse, Dibyendu Majumdar

On Sat, Jul 29, 2017 at 12:04 PM, Luc Van Oostenryck
<luc.vanoostenryck@gmail.com> wrote:
>
> You can't do that once cycles are involved. You need something
> like the marking algorithm used by kill_unreachable_bbs() for that.
> And it was such a  cycle that created the problem with the false
> "crazy programmer" warning.

No I don't think so. The find dominator already taking the cycles into
account. By definition if X dominate Y, means every execution flow
from entry point to Y will need to go through X. If X was not reachable,
nor does Y. It does not change where the block get deleted. It just don't
not need to do the marking algorithm. That is the point of dominator tree.

Chris

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: ptrlist-iterator performance on one wine source file
  2017-07-29 16:25       ` Christopher Li
@ 2017-07-29 16:30         ` Christopher Li
  2017-07-29 16:35         ` Luc Van Oostenryck
  1 sibling, 0 replies; 33+ messages in thread
From: Christopher Li @ 2017-07-29 16:30 UTC (permalink / raw)
  To: Luc Van Oostenryck; +Cc: Linux-Sparse, Dibyendu Majumdar

On Sat, Jul 29, 2017 at 12:25 PM, Christopher Li <sparse@chrisli.org> wrote:
> No I don't think so. The find dominator already taking the cycles into
> account. By definition if X dominate Y, means every execution flow
> from entry point to Y will need to go through X. If X was not reachable,
> nor does Y. It does not change where the block get deleted. It just don't
> not need to do the marking algorithm. That is the point of dominator tree.

Hold on I think I make a mistake there. You are right. It only work if
we already know
that block is dead. If there is cycles that block has incoming edges. Let's
me think about it. There should be better way to detect that is back edge.

Thinking.

Chris

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: ptrlist-iterator performance on one wine source file
  2017-07-29 16:25       ` Christopher Li
  2017-07-29 16:30         ` Christopher Li
@ 2017-07-29 16:35         ` Luc Van Oostenryck
  2017-07-29 19:33           ` Christopher Li
  1 sibling, 1 reply; 33+ messages in thread
From: Luc Van Oostenryck @ 2017-07-29 16:35 UTC (permalink / raw)
  To: Christopher Li; +Cc: Linux-Sparse, Dibyendu Majumdar

On Sat, Jul 29, 2017 at 6:25 PM, Christopher Li <sparse@chrisli.org> wrote:
> On Sat, Jul 29, 2017 at 12:04 PM, Luc Van Oostenryck
> <luc.vanoostenryck@gmail.com> wrote:
>>
>> You can't do that once cycles are involved. You need something
>> like the marking algorithm used by kill_unreachable_bbs() for that.
>> And it was such a  cycle that created the problem with the false
>> "crazy programmer" warning.
>
> No I don't think so. The find dominator already taking the cycles into
> account. By definition if X dominate Y, means every execution flow
> from entry point to Y will need to go through X. If X was not reachable,
> nor does Y. It does not change where the block get deleted. It just don't
> not need to do the marking algorithm. That is the point of dominator tree.

OK, I misread and misunderstood that you was talking about  the
dominator *tree*.

The real problem with such a tree is that you need to maintain it as
it potentially changes each time there is a change in the CFG.
And of course, building this tree is not linear (in the number of BBs)
while finding the dead BBs is linear.

-- Luc

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: ptrlist-iterator performance on one wine source file
  2017-07-29 16:35         ` Luc Van Oostenryck
@ 2017-07-29 19:33           ` Christopher Li
  2017-07-29 21:47             ` Luc Van Oostenryck
  0 siblings, 1 reply; 33+ messages in thread
From: Christopher Li @ 2017-07-29 19:33 UTC (permalink / raw)
  To: Luc Van Oostenryck; +Cc: Linux-Sparse, Dibyendu Majumdar

On Sat, Jul 29, 2017 at 12:35 PM, Luc Van Oostenryck
<luc.vanoostenryck@gmail.com> wrote:
>>> You can't do that once cycles are involved. You need something
>>> like the marking algorithm used by kill_unreachable_bbs() for that.
>>> And it was such a  cycle that created the problem with the false
>>> "crazy programmer" warning.
>>
>> No I don't think so. The find dominator already taking the cycles into
>> account. By definition if X dominate Y, means every execution flow
>> from entry point to Y will need to go through X. If X was not reachable,
>> nor does Y. It does not change where the block get deleted. It just don't
>> not need to do the marking algorithm. That is the point of dominator tree.
>
> OK, I misread and misunderstood that you was talking about  the
> dominator *tree*.

I was confused after your misunderstood as well. Well for a few minutes.
The dominator tree does not have back edges. So if the edge on CFG
we are removing is matching the edge on dominator tree. In other words,
we are removing an edge from the dominator tree,  the whole branch
of that tree can be removed as dead code.

> The real problem with such a tree is that you need to maintain it as
> it potentially changes each time there is a change in the CFG.

Right. So far most of the case is removing edges and blocks. That should
be relative trivial to update the dominator tree. I haven't try it so we will
see.

Do we have a case actually *add* edge or block to the CFG?
I don't recall one.

> And of course, building this tree is not linear (in the number of BBs)
> while finding the dead BBs is linear.
Find one round of dead BBs is linear, N number of BBs. Right now we
mark and sweep the CFG for every node removed, total of M dead nodes.
Then the total is M*N.

In the memops finding dominating store is doing a lot worse. That is
why gcc complete that file almost instantly. Sparse takes 30 seconds
on my machine. One big problem is it did not cache the dominating
result. It is redoing the finding again and again.

The algorithm I am using is this one:

Cooper, Keith D.; Harvey, Timothy J.; and Kennedy, Ken (2001).
A Simple, Fast Dominance Algorithm
https://www.cs.rice.edu/~keith/EMBED/dom.pdf

It is O(N^2), in practice it is faster than  Lengauer and Tarjan
which is O(E *log(N)).

Most of the sparse source code I try converge within 3 rounds.
For practical purpose it feels like linear.

Chris

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: ptrlist-iterator performance on one wine source file
  2017-07-29 19:33           ` Christopher Li
@ 2017-07-29 21:47             ` Luc Van Oostenryck
  2017-07-30  4:15               ` Christopher Li
  0 siblings, 1 reply; 33+ messages in thread
From: Luc Van Oostenryck @ 2017-07-29 21:47 UTC (permalink / raw)
  To: Christopher Li; +Cc: Linux-Sparse, Dibyendu Majumdar

On Sat, Jul 29, 2017 at 9:33 PM, Christopher Li <sparse@chrisli.org> wrote:
> On Sat, Jul 29, 2017 at 12:35 PM, Luc Van Oostenryck
> <luc.vanoostenryck@gmail.com> wrote:
>>>> You can't do that once cycles are involved. You need something
>>>> like the marking algorithm used by kill_unreachable_bbs() for that.
>>>> And it was such a  cycle that created the problem with the false
>>>> "crazy programmer" warning.
>>>
>>> No I don't think so. The find dominator already taking the cycles into
>>> account. By definition if X dominate Y, means every execution flow
>>> from entry point to Y will need to go through X. If X was not reachable,
>>> nor does Y. It does not change where the block get deleted. It just don't
>>> not need to do the marking algorithm. That is the point of dominator tree.
>>
>> OK, I misread and misunderstood that you was talking about  the
>> dominator *tree*.
>
> I was confused after your misunderstood as well. Well for a few minutes.
> The dominator tree does not have back edges. So if the edge on CFG
> we are removing is matching the edge on dominator tree. In other words,
> we are removing an edge from the dominator tree,  the whole branch
> of that tree can be removed as dead code.

Mmmm, interesting.

>> The real problem with such a tree is that you need to maintain it as
>> it potentially changes each time there is a change in the CFG.
>
> Right. So far most of the case is removing edges and blocks. That should
> be relative trivial to update the dominator tree. I haven't try it so we will
> see.
>
> Do we have a case actually *add* edge or block to the CFG?
> I don't recall one.

We do, more or less.
Once the code is linearized and inlining is done we:
- never add new BBs
- remove some BBs (some removing edges from the CFG)
- do several kinds of branch simplification (that's moving
  edge, so technically it's adding edge to the CFG, not sure it
  change the dom tree though).

>> And of course, building this tree is not linear (in the number of BBs)
>> while finding the dead BBs is linear.
> Find one round of dead BBs is linear, N number of BBs. Right now we
> mark and sweep the CFG for every node removed, total of M dead nodes.
> Then the total is M*N.

Yes, but this calculation is not correct at all.
- each time a node is removed, the total number of nodes is smaller
  and so the next time is a bit faster (this would correspond to a factor
  of 2)
- much more importantly, each time kill_unreachable_bbs() is called
  *all* the currently dead BBs are removed at once. So single call
  can kill several BBs. Of course, it will be different for each CFG/input
  files.

> In the memops finding dominating store is doing a lot worse. That is
> why gcc complete that file almost instantly. Sparse takes 30 seconds
> on my machine. One big problem is it did not cache the dominating
> result. It is redoing the finding again and again.

Uh?
Which input file your talking about?

BTW, two months ago or so, I posted a sort of mini-report about
quadratic behaviour with the current code. This quadratic behaviour
doesn't exist with the series I have for the misplaced phi-nodes.

> The algorithm I am using is this one:
>
> Cooper, Keith D.; Harvey, Timothy J.; and Kennedy, Ken (2001).
> A Simple, Fast Dominance Algorithm
> https://www.cs.rice.edu/~keith/EMBED/dom.pdf
>
> It is O(N^2), in practice it is faster than  Lengauer and Tarjan
> which is O(E *log(N)).
>
> Most of the sparse source code I try converge within 3 rounds.
> For practical purpose it feels like linear.

*smile* Feels like linear?
Did you try with several input files, some with big functions?

> Chris

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: ptrlist-iterator performance on one wine source file
  2017-07-29 21:47             ` Luc Van Oostenryck
@ 2017-07-30  4:15               ` Christopher Li
  2017-07-30 15:12                 ` Luc Van Oostenryck
  0 siblings, 1 reply; 33+ messages in thread
From: Christopher Li @ 2017-07-30  4:15 UTC (permalink / raw)
  To: Luc Van Oostenryck; +Cc: Linux-Sparse, Dibyendu Majumdar

On Sat, Jul 29, 2017 at 5:47 PM, Luc Van Oostenryck
<luc.vanoostenryck@gmail.com> wrote:
>
> We do, more or less.
> Once the code is linearized and inlining is done we:
> - never add new BBs
> - remove some BBs (some removing edges from the CFG)
> - do several kinds of branch simplification (that's moving
>   edge, so technically it's adding edge to the CFG, not sure it
>   change the dom tree though).
>
That is merging nodes right? Two nodes combine as one.
If we consider the two nodes and one sub graph. The combine
sub graph does not reach to new node. In other words, it does
not extend its reachability. My guess is that should be fine.

Moving block to another place is another story.
>
> Yes, but this calculation is not correct at all.
> - each time a node is removed, the total number of nodes is smaller
>   and so the next time is a bit faster (this would correspond to a factor
>   of 2)

if N >> M then it does not matter much.


> - much more importantly, each time kill_unreachable_bbs() is called
>   *all* the currently dead BBs are removed at once. So single call
>   can kill several BBs. Of course, it will be different for each CFG/input
>   files.

Yes, that would be linear to the number of blocks removed. It still
need to go through the blocks to clean up the instructions usage etc.

>> In the memops finding dominating store is doing a lot worse. That is
>> why gcc complete that file almost instantly. Sparse takes 30 seconds
>> on my machine. One big problem is it did not cache the dominating
>> result. It is redoing the finding again and again.



>
> Uh?
> Which input file your talking about?

This ptrlist testing wine source file that takes  23 second for sparse to run.
I take a brief look at it, it is doing a lot of dominating search.

>
> *smile* Feels like linear?
> Did you try with several input files, some with big functions?
I just try some sparse source file. The largest one is in parse.c.

The paper has more detail report on using huge number of nodes.
Tested on real code and random generated CFG for really large
number of nodes. I am not going repeat it here.

BTW, I just find out LLVM was using the exact same algorithm at
some point. Not sue what they are using now. They might not build
the whole tree any more.

Chris

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: ptrlist-iterator performance on one wine source file
  2017-07-30  4:15               ` Christopher Li
@ 2017-07-30 15:12                 ` Luc Van Oostenryck
  2017-07-30 15:49                   ` Christopher Li
  0 siblings, 1 reply; 33+ messages in thread
From: Luc Van Oostenryck @ 2017-07-30 15:12 UTC (permalink / raw)
  To: Christopher Li; +Cc: Linux-Sparse, Dibyendu Majumdar

On Sun, Jul 30, 2017 at 12:15:40AM -0400, Christopher Li wrote:
> On Sat, Jul 29, 2017 at 5:47 PM, Luc Van Oostenryck
> <luc.vanoostenryck@gmail.com> wrote:
> >
> > We do, more or less.
> > Once the code is linearized and inlining is done we:
> > - never add new BBs
> > - remove some BBs (some removing edges from the CFG)
> > - do several kinds of branch simplification (that's moving
> >   edge, so technically it's adding edge to the CFG, not sure it
> >   change the dom tree though).
> >
> That is merging nodes right? Two nodes combine as one.

I was more thinking to things like done in try_to_simplify_bb().

> Moving block to another place is another story.
> >
> > Yes, but this calculation is not correct at all.
> > - each time a node is removed, the total number of nodes is smaller
> >   and so the next time is a bit faster (this would correspond to a factor
> >   of 2)
> 
> if N >> M then it does not matter much.

Indeed. 

> > - much more importantly, each time kill_unreachable_bbs() is called
> >   *all* the currently dead BBs are removed at once. So single call
> >   can kill several BBs. Of course, it will be different for each CFG/input
> >   files.
> 
> Yes, that would be linear to the number of blocks removed. It still
> need to go through the blocks to clean up the instructions usage etc.

Yes, but that's totally independent of how and how often the detection
of dead code is done. At the end, every dead instructions will need
to be cleaned up (the minimum is removing pseudo usage).

> >> In the memops finding dominating store is doing a lot worse. That is
> >> why gcc complete that file almost instantly. Sparse takes 30 seconds
> >> on my machine. One big problem is it did not cache the dominating
> >> result. It is redoing the finding again and again.
> 
> > Uh?
> > Which input file your talking about?
> 
> This ptrlist testing wine source file that takes  23 second for sparse to run.
> I take a brief look at it, it is doing a lot of dominating search.

Is it possible to have a pathname or a link?

> > *smile* Feels like linear?
> > Did you try with several input files, some with big functions?
> I just try some sparse source file. The largest one is in parse.c.

It's not the size of the file that matter here, it's the size
(and complexity) of the function(s).

> The paper has more detail report on using huge number of nodes.
> Tested on real code and random generated CFG for really large
> number of nodes. I am not going repeat it here.

No, I know about this paper.

> BTW, I just find out LLVM was using the exact same algorithm at
> some point. Not sue what they are using now. They might not build
> the whole tree any more.

It's possible, indeed.

-- Luc

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: ptrlist-iterator performance on one wine source file
  2017-07-30 15:12                 ` Luc Van Oostenryck
@ 2017-07-30 15:49                   ` Christopher Li
  2017-07-30 16:16                     ` Luc Van Oostenryck
  0 siblings, 1 reply; 33+ messages in thread
From: Christopher Li @ 2017-07-30 15:49 UTC (permalink / raw)
  To: Luc Van Oostenryck; +Cc: Linux-Sparse, Dibyendu Majumdar

On Sun, Jul 30, 2017 at 11:12 AM, Luc Van Oostenryck
<luc.vanoostenryck@gmail.com> wrote:
> On Sun, Jul 30, 2017 at 12:15:40AM -0400, Christopher Li wrote:
>> On Sat, Jul 29, 2017 at 5:47 PM, Luc Van Oostenryck
>> <luc.vanoostenryck@gmail.com> wrote:
>> >
>> > We do, more or less.
>> > Once the code is linearized and inlining is done we:
>> > - never add new BBs
>> > - remove some BBs (some removing edges from the CFG)
>> > - do several kinds of branch simplification (that's moving
>> >   edge, so technically it's adding edge to the CFG, not sure it
>> >   change the dom tree though).
>> >
>> That is merging nodes right? Two nodes combine as one.
>
> I was more thinking to things like done in try_to_simplify_bb().
>
>> Moving block to another place is another story.
>> >
>> > Yes, but this calculation is not correct at all.
>> > - each time a node is removed, the total number of nodes is smaller
>> >   and so the next time is a bit faster (this would correspond to a factor
>> >   of 2)
>>
>> if N >> M then it does not matter much.
>
> Indeed.
>
>> > - much more importantly, each time kill_unreachable_bbs() is called
>> >   *all* the currently dead BBs are removed at once. So single call
>> >   can kill several BBs. Of course, it will be different for each CFG/input
>> >   files.
>>
>> Yes, that would be linear to the number of blocks removed. It still
>> need to go through the blocks to clean up the instructions usage etc.
>
> Yes, but that's totally independent of how and how often the detection
> of dead code is done. At the end, every dead instructions will need
> to be cleaned up (the minimum is removing pseudo usage).
>
>> >> In the memops finding dominating store is doing a lot worse. That is
>> >> why gcc complete that file almost instantly. Sparse takes 30 seconds
>> >> on my machine. One big problem is it did not cache the dominating
>> >> result. It is redoing the finding again and again.
>>
>> > Uh?
>> > Which input file your talking about?
>>
>> This ptrlist testing wine source file that takes  23 second for sparse to run.
>> I take a brief look at it, it is doing a lot of dominating search.
>
> Is it possible to have a pathname or a link?

It is the very first email I send out:


git clone git://source.winehq.org/git/wine.git
cd win/dlls/usp10/tests

The test command:

time sparse -m64 -c -o usp10.o usp10.c -I. -I../../../include
-D__WINESRC__ -D_REENTRANT -fPIC -Wall -pipe -fno-strict-aliasing
-Wdeclaration-after-statement -Wempty-body -Wignored-qualifiers
-Wshift-overflow=2 -Wstrict-prototypes -Wtype-limits
-Wunused-but-set-parameter -Wvla -Wwrite-strings -Wpointer-arith
-Wlogical-op -gdwarf-2 -gstrict-dwarf -g -O2

I think gcc compile this file very fast but sparse spend a lot of time
on it. My impression it is spending time repeat finding dominating
stores. I did not look at it very deep, but I know sparse did not cache
any dominating information. It do fresh search every time.

> It's not the size of the file that matter here, it's the size
> (and complexity) of the function(s).

Yes, mean the complexity of the functions. How many blocks.
My impression parse.c has the largest one I saw so far. I have't
done it very scientifically. Other file all have relatively small functions.

Chris

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: ptrlist-iterator performance on one wine source file
  2017-07-30 15:49                   ` Christopher Li
@ 2017-07-30 16:16                     ` Luc Van Oostenryck
  2017-08-01 20:33                       ` Luc Van Oostenryck
  0 siblings, 1 reply; 33+ messages in thread
From: Luc Van Oostenryck @ 2017-07-30 16:16 UTC (permalink / raw)
  To: Christopher Li; +Cc: Linux-Sparse, Dibyendu Majumdar

On Sun, Jul 30, 2017 at 5:49 PM, Christopher Li <sparse@chrisli.org> wrote:

>>> >> In the memops finding dominating store is doing a lot worse. That is
>>> >> why gcc complete that file almost instantly. Sparse takes 30 seconds
>>> >> on my machine. One big problem is it did not cache the dominating
>>> >> result. It is redoing the finding again and again.
>>>
>>> > Uh?
>>> > Which input file your talking about?
>>>
>>> This ptrlist testing wine source file that takes  23 second for sparse to run.
>>> I take a brief look at it, it is doing a lot of dominating search.
>>
>> Is it possible to have a pathname or a link?
>
> It is the very first email I send out:
>
>
> git clone git://source.winehq.org/git/wine.git
> cd win/dlls/usp10/tests
>
> The test command:
>
> time sparse -m64 -c -o usp10.o usp10.c -I. -I../../../include
> -D__WINESRC__ -D_REENTRANT -fPIC -Wall -pipe -fno-strict-aliasing
> -Wdeclaration-after-statement -Wempty-body -Wignored-qualifiers
> -Wshift-overflow=2 -Wstrict-prototypes -Wtype-limits
> -Wunused-but-set-parameter -Wvla -Wwrite-strings -Wpointer-arith
> -Wlogical-op -gdwarf-2 -gstrict-dwarf -g -O2

OK, thanks. I'll take a look once the infinite loop problem will be closed.

> I think gcc compile this file very fast but sparse spend a lot of time on it.

Interesting.
On most input sparse is much faster than gcc, often by a factor or 10
or even more. But of course, if there is a problem with sparse ...

> My impression it is spending time repeat finding dominating stores.
Possible indeed.
> I did not look at it very deep, but I know sparse did not cache
> any dominating information. It do fresh search every time.

Yes. It's even done for each instruction to CSE (but most of the
time it's not much costly, still looking after a few parents though).

>> It's not the size of the file that matter here, it's the size
>> (and complexity) of the function(s).
>
> Yes, mean the complexity of the functions. How many blocks.
> My impression parse.c has the largest one I saw so far. I have't
> done it very scientifically. Other file all have relatively small functions.

I really don't think sparse has any function large enough to worry after
non-linearity (you would need at least a few hundred BBs).
IMO, the thing to look at/worry about is the constant factor.

> Chris

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: ptrlist-iterator performance on one wine source file
  2017-07-30 16:16                     ` Luc Van Oostenryck
@ 2017-08-01 20:33                       ` Luc Van Oostenryck
  2017-08-01 21:09                         ` Christopher Li
  0 siblings, 1 reply; 33+ messages in thread
From: Luc Van Oostenryck @ 2017-08-01 20:33 UTC (permalink / raw)
  To: Christopher Li; +Cc: Linux-Sparse, Dibyendu Majumdar

On Sun, Jul 30, 2017 at 06:16:03PM +0200, Luc Van Oostenryck wrote:
> On Sun, Jul 30, 2017 at 5:49 PM, Christopher Li <sparse@chrisli.org> wrote:
> 
> >>> >> In the memops finding dominating store is doing a lot worse. That is
> >>> >> why gcc complete that file almost instantly. Sparse takes 30 seconds
> >>> >> on my machine. One big problem is it did not cache the dominating
> >>> >> result. It is redoing the finding again and again.
> >>>
> >>> > Uh?
> >>> > Which input file your talking about?
> >>>
> >>> This ptrlist testing wine source file that takes  23 second for sparse to run.
> >>> I take a brief look at it, it is doing a lot of dominating search.
> >>
> >> Is it possible to have a pathname or a link?
> >
> > It is the very first email I send out:
> >
> >
> > git clone git://source.winehq.org/git/wine.git
> > cd win/dlls/usp10/tests
> >
> > The test command:
> >
> > time sparse -m64 -c -o usp10.o usp10.c -I. -I../../../include
> > -D__WINESRC__ -D_REENTRANT -fPIC -Wall -pipe -fno-strict-aliasing
> > -Wdeclaration-after-statement -Wempty-body -Wignored-qualifiers
> > -Wshift-overflow=2 -Wstrict-prototypes -Wtype-limits
> > -Wunused-but-set-parameter -Wvla -Wwrite-strings -Wpointer-arith
> > -Wlogical-op -gdwarf-2 -gstrict-dwarf -g -O2
> 
> OK, thanks. I'll take a look once the infinite loop problem will be closed.
> 
> > I think gcc compile this file very fast but sparse spend a lot of time on it.
> 
> Interesting.
> On most input sparse is much faster than gcc, often by a factor or 10
> or even more. But of course, if there is a problem with sparse ...
> 
> > My impression it is spending time repeat finding dominating stores.
> Possible indeed.

When I try the command you gave with -rc4 sparse I get:
	real	0m3.281s
	user	0m3.175s
	sys	0m0.097s
wich is very far from the 23-30s you got. Dunno what the difference could be.

For comparison, with the new 7 patches the time is now:
	real	0m2.146s
	user	0m1.928s
	sys	0m0.214s
wich is significatively faster.

And with gcc I get:
	real	0m2.153s
	user	0m2.080s
	sys	0m0.078s

Also, the result of the preprocessing is a 13K lines, 788K bytes file.

I would still expect that sparse is (much) faster than gcc, but
with the number I have here, I'm not sure I can say something is wrong.
But we should first find why you got 23-30s while I got 3.3.

-- Luc

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: ptrlist-iterator performance on one wine source file
  2017-08-01 20:33                       ` Luc Van Oostenryck
@ 2017-08-01 21:09                         ` Christopher Li
  2017-08-01 21:46                           ` Luc Van Oostenryck
  0 siblings, 1 reply; 33+ messages in thread
From: Christopher Li @ 2017-08-01 21:09 UTC (permalink / raw)
  To: Luc Van Oostenryck; +Cc: Linux-Sparse, Dibyendu Majumdar

[-- Attachment #1: Type: text/plain, Size: 1265 bytes --]

On Tue, Aug 1, 2017 at 4:33 PM, Luc Van Oostenryck
<luc.vanoostenryck@gmail.com> wrote:
>
> When I try the command you gave with -rc4 sparse I get:
>         real    0m3.281s
>         user    0m3.175s
>         sys     0m0.097s
> wich is very far from the 23-30s you got. Dunno what the difference could be.

Did you run it with all the parameter?

If I just run the sort version:

time sparse -m64 -c -o usp10.o usp10.c -I. -I../../../include
-D__WINESRC__ -D_REENTRANT

It is:
real 0m4.532s
user 0m4.436s
sys 0m0.075s

But if I run with all parameter.

sh -x slow.sh

+ sparse -m64 -c -o usp10.o usp10.c -I. -I../../../include
-D__WINESRC__ -D_REENTRANT -fPIC -Wall -pipe -fno-strict-aliasing
-Wdeclaration-after-statement -Wempty-body -Wignored-qualifiers
-Wshift-overflow=2 -Wstrict-prototypes -Wtype-limits
-Wunused-but-set-parameter -Wvla -Wwrite-strings -Wpointer-arith
-Wlogical-op -gdwarf-2 -gstrict-dwarf -g -O2
/usr/include/sys/sysmacros.h:79:1: warning: constant
0xfffff00000000000u is so big it is unsigned long
/usr/include/sys/sysmacros.h:80:1: warning: constant
0x00000ffffff00000u is so big it is unsigned long
...

real 0m24.771s
user 0m24.447s
sys 0m0.213s

Again, I haven't dig very deep with this yet. Might be a bug some where indeed.

Chris

[-- Attachment #2: slow.sh --]
[-- Type: application/x-sh, Size: 355 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: ptrlist-iterator performance on one wine source file
  2017-08-01 21:09                         ` Christopher Li
@ 2017-08-01 21:46                           ` Luc Van Oostenryck
  2017-08-01 23:37                             ` Christopher Li
       [not found]                             ` <CANeU7QmzundH7qpdYhQqDJgBv+5pPemwft+1uH5oVQ1POnoQDw@mail.gmail.com>
  0 siblings, 2 replies; 33+ messages in thread
From: Luc Van Oostenryck @ 2017-08-01 21:46 UTC (permalink / raw)
  To: Christopher Li; +Cc: Linux-Sparse, Dibyendu Majumdar

On Tue, Aug 1, 2017 at 11:09 PM, Christopher Li <sparse@chrisli.org> wrote:
> On Tue, Aug 1, 2017 at 4:33 PM, Luc Van Oostenryck
> <luc.vanoostenryck@gmail.com> wrote:
>>
>> When I try the command you gave with -rc4 sparse I get:
>>         real    0m3.281s
>>         user    0m3.175s
>>         sys     0m0.097s
>> wich is very far from the 23-30s you got. Dunno what the difference could be.
>
> Did you run it with all the parameter?

I did

> If I just run the sort version:
>
> time sparse -m64 -c -o usp10.o usp10.c -I. -I../../../include
> -D__WINESRC__ -D_REENTRANT
>
> It is:
> real 0m4.532s
> user 0m4.436s
> sys 0m0.075s
>
> But if I run with all parameter.
>
> sh -x slow.sh
>
> + sparse -m64 -c -o usp10.o usp10.c -I. -I../../../include
> -D__WINESRC__ -D_REENTRANT -fPIC -Wall -pipe -fno-strict-aliasing
> -Wdeclaration-after-statement -Wempty-body -Wignored-qualifiers
> -Wshift-overflow=2 -Wstrict-prototypes -Wtype-limits
> -Wunused-but-set-parameter -Wvla -Wwrite-strings -Wpointer-arith
> -Wlogical-op -gdwarf-2 -gstrict-dwarf -g -O2
> /usr/include/sys/sysmacros.h:79:1: warning: constant
> 0xfffff00000000000u is so big it is unsigned long
> /usr/include/sys/sysmacros.h:80:1: warning: constant
> 0x00000ffffff00000u is so big it is unsigned long
> ...
>
> real 0m24.771s
> user 0m24.447s
> sys 0m0.213s
>
> Again, I haven't dig very deep with this yet. Might be a bug some where indeed.

Here, both give essentially the same time.
Which is quite normal as none of the 'additional' options
should make any difference in how sparse is processing
things: it's just some more flags for warnings and none of
-fPIC -pipe -fno-strict-aliasing -gdwarf-2 -gstrict-dwarf -g
has any effects for sparse and the only effect of -O2 is
to define __OPTIMIZE__.

-- Luc

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: ptrlist-iterator performance on one wine source file
  2017-08-01 21:46                           ` Luc Van Oostenryck
@ 2017-08-01 23:37                             ` Christopher Li
  2017-08-02  0:42                               ` Christopher Li
       [not found]                             ` <CANeU7QmzundH7qpdYhQqDJgBv+5pPemwft+1uH5oVQ1POnoQDw@mail.gmail.com>
  1 sibling, 1 reply; 33+ messages in thread
From: Christopher Li @ 2017-08-01 23:37 UTC (permalink / raw)
  To: Luc Van Oostenryck; +Cc: Linux-Sparse, Dibyendu Majumdar

On Tue, Aug 1, 2017 at 5:46 PM, Luc Van Oostenryck
<luc.vanoostenryck@gmail.com> wrote:
> Here, both give essentially the same time.
> Which is quite normal as none of the 'additional' options
> should make any difference in how sparse is processing
> things: it's just some more flags for warnings and none of
> -fPIC -pipe -fno-strict-aliasing -gdwarf-2 -gstrict-dwarf -g
> has any effects for sparse and the only effect of -O2 is
> to define __OPTIMIZE__.

So you can't reproduce it. Let me dig deeper to see what is going on.

Chris

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: ptrlist-iterator performance on one wine source file
  2017-08-01 23:37                             ` Christopher Li
@ 2017-08-02  0:42                               ` Christopher Li
  0 siblings, 0 replies; 33+ messages in thread
From: Christopher Li @ 2017-08-02  0:42 UTC (permalink / raw)
  To: Luc Van Oostenryck; +Cc: Linux-Sparse, Dibyendu Majumdar

On Tue, Aug 1, 2017 at 7:37 PM, Christopher Li <sparse@chrisli.org> wrote:
> So you can't reproduce it. Let me dig deeper to see what is going on.

#0  pseudo_in_list (list=0x7f5a6972de10, pseudo=0x7f5a7aa65d68) at
liveness.c:163
#1  0x0000000000427924 in add_pseudo_exclusive (list=0x7f5a7dcf3758,
pseudo=0x7f5a7aa65d68)
    at liveness.c:173
#2  0x0000000000427afe in track_bb_liveness (bb=0x7f5a7dcf3680) at
liveness.c:207
#3  0x0000000000427db9 in track_pseudo_liveness (ep=0x7f5a863d1010) at
liveness.c:250
#4  0x000000000041d313 in linearize_fn (sym=0x7f5a8c5fd8b0,
base_type=0x7f5a8c5fd990)
    at linearize.c:2260
#5  0x000000000041d3b1 in linearize_symbol (sym=0x7f5a8c5fd8b0) at
linearize.c:2291
#6  0x0000000000401a08 in check_symbols (list=0x7f5a8c685910) at sparse.c:282
#7  0x0000000000401afe in main (argc=29, argv=0x7fff1f8d99c8) at sparse.c:303

I take a closer look, it seems sparse is spending a lot of time in
track_pseudo_liveness.
Compile with -O0 does not change any thing. At this point it could be
any thing.
Looking more into it.

Chris

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: ptrlist-iterator performance on one wine source file
       [not found]                             ` <CANeU7QmzundH7qpdYhQqDJgBv+5pPemwft+1uH5oVQ1POnoQDw@mail.gmail.com>
@ 2017-08-02 22:50                               ` Luc Van Oostenryck
  2017-08-03 21:49                                 ` Luc Van Oostenryck
  0 siblings, 1 reply; 33+ messages in thread
From: Luc Van Oostenryck @ 2017-08-02 22:50 UTC (permalink / raw)
  To: Christopher Li; +Cc: Linux-Sparse, Dibyendu Majumdar

On Wed, Aug 2, 2017 at 3:17 AM, Christopher Li <sparse@chrisli.org> wrote:
> On Tue, Aug 1, 2017 at 5:46 PM, Luc Van Oostenryck
> <luc.vanoostenryck@gmail.com> wrote:
>> has any effects for sparse and the only effect of -O2 is
>> to define __OPTIMIZE__.
>
> Yes, indeed. It is the -O2 make the difference.
>
> I recently upgrade to Fedora 26. I think it is the system header file
> making a difference
> on the __OPTIMIZE__. I attach two files here. O2.c is the one with -O2 flag
> after processor. The 0.c is the one without.
>
> There is huge difference in them.
>
> I confirm with O2.c I am seeing the 24 second delay. And 0.c 4 seconds.
>
> I attach the two files with gzip.

It seems that the email was rejected on the mailing list.

> I think you should be able to reproduce it with O2.c now.

Yes, I can reproduce it now.
The differences in the two file are not big. Basically, in -O2 there is
- a bunch of small functions which have now an inline definition
- a set of strcmp(winetest_platform, "wine") which are replaced
  by a macro of hell.
It's, of course, the last one which creates the problem.
The macro seems to try to optimize the compare using the fact
that the compiler will statically evaluate things like:
- strlen("wine")
- "wine"[0], "wine"[1], ...
but sparse doesn't do this kind of simplification (yet) and this result
in much much more code.

A first inspection of the generated code doesn't show anything
obviously wrong but I don't exclude there is another problem.

-- Luc

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: ptrlist-iterator performance on one wine source file
  2017-08-02 22:50                               ` Luc Van Oostenryck
@ 2017-08-03 21:49                                 ` Luc Van Oostenryck
  2017-08-03 22:29                                   ` Luc Van Oostenryck
                                                     ` (2 more replies)
  0 siblings, 3 replies; 33+ messages in thread
From: Luc Van Oostenryck @ 2017-08-03 21:49 UTC (permalink / raw)
  To: Christopher Li; +Cc: Linux-Sparse, Dibyendu Majumdar

On Thu, Aug 03, 2017 at 12:50:20AM +0200, Luc Van Oostenryck wrote:
> On Wed, Aug 2, 2017 at 3:17 AM, Christopher Li <sparse@chrisli.org> wrote:
> > On Tue, Aug 1, 2017 at 5:46 PM, Luc Van Oostenryck
> > <luc.vanoostenryck@gmail.com> wrote:
> >> has any effects for sparse and the only effect of -O2 is
> >> to define __OPTIMIZE__.
> >
> > Yes, indeed. It is the -O2 make the difference.
> >
> > I recently upgrade to Fedora 26. I think it is the system header file
> > making a difference
> > on the __OPTIMIZE__. I attach two files here. O2.c is the one with -O2 flag
> > after processor. The 0.c is the one without.
> >
> > There is huge difference in them.
> >
> > I confirm with O2.c I am seeing the 24 second delay. And 0.c 4 seconds.
> >
> > I attach the two files with gzip.
> 
> It seems that the email was rejected on the mailing list.
> 
> > I think you should be able to reproduce it with O2.c now.
> 
> Yes, I can reproduce it now.
> The differences in the two file are not big. Basically, in -O2 there is
> - a bunch of small functions which have now an inline definition
> - a set of strcmp(winetest_platform, "wine") which are replaced
>   by a macro of hell.
> It's, of course, the last one which creates the problem.
> The macro seems to try to optimize the compare using the fact
> that the compiler will statically evaluate things like:
> - strlen("wine")
> - "wine"[0], "wine"[1], ...
> but sparse doesn't do this kind of simplification (yet) and this result
> in much much more code.
> 
> A first inspection of the generated code doesn't show anything
> obviously wrong but I don't exclude there is another problem.

I looked a bit more at this and the problem is not really because of this
evaluation sparse doesn't do. Also I work on a simplified version of the
files keeping nothing after test_ScriptItemize().

1) some numbers:
- GCC compile both preprocessed files in .9s with -O2.
- sparse check the O0 file in 1.93s and O2 file in 13s.
Thus even on the O0 file, the time is already too high because generaly
sparse is roughly 10 times faster than gcc -O2, here is twice as slow.

sparse emits roughly 5 times more BBs for the O2 file than for the O0 one,
it also emits roughly 5 times more instructions and take roughly 5 more
time to generate them (ok, 7 times more).
Thus the processing time and the number of instructions scale well
with the number of BBs emitted (and these BBs are emitted before any
simplification are made), so there is no sign of any non-linearity
nor any oddities with simplification/optimization that would run crazy.

2) I don't think that the lack of static evaluation of strlen("wine") or
   "wine"[0] is a problem here because where present this code is preceded
   by a test __builtin_constant_p(winetest_platform) wich fail.
   All this code is then quickly optimized away (but have first been
   generated as linearized code).

3) the situation with the macro from hell is even worse than I thought:
   it's present 7 times but is in a inline function which is itself
   used 292 times. Thus this code is present 2044 times!
   Some tests show that the time is (roughly) proportional to the
   number of time this inline function is used.

4) if we replace 'inline' by 'inline __attribute__((always_inline))'
   GCC needs roughly 58s to compile the O0 or O2 file.

My conclusion is that most of the problem here comes from the fact that:
   - sparse always inlines function marked as inline
   - and does so very early, before any optimizations (so the extra macro
     code is inlined 2044 times and need to be processed 2044 times).
GCC needs the same time for both file because it inlines functions after
a first optimization pass (I'm guessing here).

-- Luc

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: ptrlist-iterator performance on one wine source file
  2017-08-03 21:49                                 ` Luc Van Oostenryck
@ 2017-08-03 22:29                                   ` Luc Van Oostenryck
  2017-08-03 22:35                                   ` Linus Torvalds
  2017-08-04 11:33                                   ` ptrlist-iterator performance on one wine source file Luc Van Oostenryck
  2 siblings, 0 replies; 33+ messages in thread
From: Luc Van Oostenryck @ 2017-08-03 22:29 UTC (permalink / raw)
  To: Christopher Li; +Cc: Linux-Sparse, Dibyendu Majumdar

On Thu, Aug 03, 2017 at 11:49:08PM +0200, Luc Van Oostenryck wrote:
> 2) I don't think that the lack of static evaluation of strlen("wine") or
>    "wine"[0] is a problem here because where present this code is preceded
>    by a test __builtin_constant_p(winetest_platform) wich fail.
>    All this code is then quickly optimized away (but have first been
>    generated as linearized code).

There is a big problem here as __builtin_constant_p() is not expanded.

Investigating.

-- Luc

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: ptrlist-iterator performance on one wine source file
  2017-08-03 21:49                                 ` Luc Van Oostenryck
  2017-08-03 22:29                                   ` Luc Van Oostenryck
@ 2017-08-03 22:35                                   ` Linus Torvalds
  2017-08-04  0:04                                     ` Christopher Li
  2017-08-04  0:11                                     ` Luc Van Oostenryck
  2017-08-04 11:33                                   ` ptrlist-iterator performance on one wine source file Luc Van Oostenryck
  2 siblings, 2 replies; 33+ messages in thread
From: Linus Torvalds @ 2017-08-03 22:35 UTC (permalink / raw)
  To: Luc Van Oostenryck; +Cc: Christopher Li, Linux-Sparse, Dibyendu Majumdar

On Thu, Aug 3, 2017 at 2:49 PM, Luc Van Oostenryck
<luc.vanoostenryck@gmail.com> wrote:
>
> My conclusion is that most of the problem here comes from the fact that:
>    - sparse always inlines function marked as inline
>    - and does so very early, before any optimizations (so the extra macro
>      code is inlined 2044 times and need to be processed 2044 times).

Yeah, I think that's sadly fairly fundamental to how sparse inlining works.

Sparse inlines things on a tree level, but most optimizations and
simplifications are done on the SSA representation.

That very early inlining was actually a fairly big design decision
originally - at some point I actually wanted to allow sparse to have
inline functions act as untyped "templates" that it inlined, and that
had their types evaluated only within the context of being used.

I have this memory of that actually even working to some degree at
some point (ie you could leave arguments to inline functions untyped,
and they would take their type from the invocation). But that may have
been with special patches.

But yes, for this particular case it's apparently a horrible choice,
exactly because it inlines very early before any actual evaluation of
anything.

                  Linus

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: ptrlist-iterator performance on one wine source file
  2017-08-03 22:35                                   ` Linus Torvalds
@ 2017-08-04  0:04                                     ` Christopher Li
  2017-08-04  0:11                                     ` Luc Van Oostenryck
  1 sibling, 0 replies; 33+ messages in thread
From: Christopher Li @ 2017-08-04  0:04 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Luc Van Oostenryck, Linux-Sparse, Dibyendu Majumdar

On Thu, Aug 3, 2017 at 6:35 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> That very early inlining was actually a fairly big design decision
> originally - at some point I actually wanted to allow sparse to have

What is the reason behind it?

> inline functions act as untyped "templates" that it inlined, and that
> had their types evaluated only within the context of being used.

Sound a lot like macro then. Can it do any thing macro can not do?

> I have this memory of that actually even working to some degree at
> some point (ie you could leave arguments to inline functions untyped,
> and they would take their type from the invocation). But that may have
> been with special patches.
>
> But yes, for this particular case it's apparently a horrible choice,
> exactly because it inlines very early before any actual evaluation of
> anything.

Sounds like reason to implement inline at the instruction level.
I think it should be easier to copy instructions than AST.
I recall the inline copying of AST is very error prone, e.g. forget to
copy a member of the AST struct.

Chris

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: ptrlist-iterator performance on one wine source file
  2017-08-03 22:35                                   ` Linus Torvalds
  2017-08-04  0:04                                     ` Christopher Li
@ 2017-08-04  0:11                                     ` Luc Van Oostenryck
  2017-08-04  0:16                                       ` [PATCH] fix: give a type to bad conditionnal expressions Luc Van Oostenryck
  1 sibling, 1 reply; 33+ messages in thread
From: Luc Van Oostenryck @ 2017-08-04  0:11 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Christopher Li, Linux-Sparse, Dibyendu Majumdar

On Thu, Aug 03, 2017 at 03:35:04PM -0700, Linus Torvalds wrote:
> On Thu, Aug 3, 2017 at 2:49 PM, Luc Van Oostenryck
> <luc.vanoostenryck@gmail.com> wrote:
> >
> > My conclusion is that most of the problem here comes from the fact that:
> >    - sparse always inlines function marked as inline
> >    - and does so very early, before any optimizations (so the extra macro
> >      code is inlined 2044 times and need to be processed 2044 times).
> 
> Yeah, I think that's sadly fairly fundamental to how sparse inlining works.
> 
> Sparse inlines things on a tree level, but most optimizations and
> simplifications are done on the SSA representation.
> 
> That very early inlining was actually a fairly big design decision
> originally - at some point I actually wanted to allow sparse to have
> inline functions act as untyped "templates" that it inlined, and that
> had their types evaluated only within the context of being used.

Yes, it makes sense.
 
> I have this memory of that actually even working to some degree at
> some point (ie you could leave arguments to inline functions untyped,
> and they would take their type from the invocation). But that may have
> been with special patches.
> 
> But yes, for this particular case it's apparently a horrible choice,
> exactly because it inlines very early before any actual evaluation of
> anything.

Well, if we'll want one day to have some sort of automatic inlining,
like for small functions or functions that are only called once,
we'll need to have inlining on linearized code. One more thing in
the todo list. Otherwise, for 'normal' sane usage, the current
inliner work well enough.

OTOH, I just saw that the heavy inlining is only a marginal part
of the problem here. The real problem is that __builtin_constant_p()
is not expanded because of a type mismatch in a conditional:
one operand is 0 while the other is the inline function which return void.
The type mismatch makes that the expression has no type so no expansion
is done.

-- Luc

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH] fix: give a type to bad conditionnal expressions
  2017-08-04  0:11                                     ` Luc Van Oostenryck
@ 2017-08-04  0:16                                       ` Luc Van Oostenryck
  2017-08-04 12:31                                         ` Luc Van Oostenryck
  0 siblings, 1 reply; 33+ messages in thread
From: Luc Van Oostenryck @ 2017-08-04  0:16 UTC (permalink / raw)
  To: linux-sparse; +Cc: Christopher Li, Linus Torvalds, Luc Van Oostenryck

Bad conditional expressions used to have no type (NULL)
this in turn makes that some further processing are
not done. In particular, here, the expansion of the
operands are not done.

Fix this by giving to such expression a type 'bad_type'.

Note: nor gcc, not clang seems to emit a warning for the
      the testcase here which is not conform to the standard.
      OTOH, sparse complains and this was the cause of the
      non-expansion of the builtin.

Signed-off-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
---
 evaluate.c                   |  1 +
 validation/cond-err-expand.c | 27 +++++++++++++++++++++++++++
 2 files changed, 28 insertions(+)
 create mode 100644 validation/cond-err-expand.c

diff --git a/evaluate.c b/evaluate.c
index cf3cf244d..5b4abdb6a 100644
--- a/evaluate.c
+++ b/evaluate.c
@@ -1220,6 +1220,7 @@ static struct symbol *evaluate_conditional_expression(struct expression *expr)
 
 Err:
 	expression_error(expr, "incompatible types in conditional expression (%s)", typediff);
+	expr->ctype = &bad_ctype;
 	return NULL;
 
 out:
diff --git a/validation/cond-err-expand.c b/validation/cond-err-expand.c
new file mode 100644
index 000000000..72af8d4b1
--- /dev/null
+++ b/validation/cond-err-expand.c
@@ -0,0 +1,27 @@
+static inline void f(void)
+{
+	__builtin_constant_p(0);
+}
+
+int foo(int a)
+{
+	return 0 ? 0 : f();
+}
+
+int bar(int a)
+{
+	return 1 ? f() : 0;
+}
+
+/*
+ * check-name: cond-err-expand.c
+ * check-command: test-linearize -Wno-decl $file
+ *
+ * check-error-start
+cond-err-expand.c:8:18: error: incompatible types in conditional expression (different base types)
+cond-err-expand.c:13:18: error: incompatible types in conditional expression (different base types)
+ * check-error-end
+ *
+ * check-output-ignore
+ * check-excludes: call.* __builtin_constant_p
+ */
-- 
2.13.2


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: ptrlist-iterator performance on one wine source file
  2017-08-03 21:49                                 ` Luc Van Oostenryck
  2017-08-03 22:29                                   ` Luc Van Oostenryck
  2017-08-03 22:35                                   ` Linus Torvalds
@ 2017-08-04 11:33                                   ` Luc Van Oostenryck
  2017-08-04 14:51                                     ` Christopher Li
  2 siblings, 1 reply; 33+ messages in thread
From: Luc Van Oostenryck @ 2017-08-04 11:33 UTC (permalink / raw)
  To: Christopher Li; +Cc: Linux-Sparse, Dibyendu Majumdar

On Thu, Aug 03, 2017 at 11:49:08PM +0200, Luc Van Oostenryck wrote:
> 1) some numbers:
> - GCC compile both preprocessed files in .9s with -O2.
> - sparse check the O0 file in 1.93s and O2 file in 13s.
> Thus even on the O0 file, the time is already too high because generaly
> sparse is roughly 10 times faster than gcc -O2, here is twice as slow.
> ...
> 4) if we replace 'inline' by 'inline __attribute__((always_inline))'
>    GCC needs roughly 58s to compile the O0 or O2 file.

With the patch I sent, sparse now need 2.1s to compile the O2 file.

-- Luc

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] fix: give a type to bad conditionnal expressions
  2017-08-04  0:16                                       ` [PATCH] fix: give a type to bad conditionnal expressions Luc Van Oostenryck
@ 2017-08-04 12:31                                         ` Luc Van Oostenryck
  2017-08-04 14:52                                           ` Christopher Li
  2017-08-04 14:53                                           ` Christopher Li
  0 siblings, 2 replies; 33+ messages in thread
From: Luc Van Oostenryck @ 2017-08-04 12:31 UTC (permalink / raw)
  To: Linux-Sparse; +Cc: Christopher Li, Luc Van Oostenryck

Please ignore this patch as I must have forgot to recompile the last change.
The problem is there and the solution too but bad_ctype won't solve
the issue here.

I'll send another version which will effectively solve the issue but
I'm far from sure
we'll want it.

-- Luc

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: ptrlist-iterator performance on one wine source file
  2017-08-04 11:33                                   ` ptrlist-iterator performance on one wine source file Luc Van Oostenryck
@ 2017-08-04 14:51                                     ` Christopher Li
  2017-08-04 22:26                                       ` Luc Van Oostenryck
  0 siblings, 1 reply; 33+ messages in thread
From: Christopher Li @ 2017-08-04 14:51 UTC (permalink / raw)
  To: Luc Van Oostenryck; +Cc: Linux-Sparse, Dibyendu Majumdar

On Fri, Aug 4, 2017 at 7:33 AM, Luc Van Oostenryck
<luc.vanoostenryck@gmail.com> wrote:
> On Thu, Aug 03, 2017 at 11:49:08PM +0200, Luc Van Oostenryck wrote:
>> 1) some numbers:
>> - GCC compile both preprocessed files in .9s with -O2.
>> - sparse check the O0 file in 1.93s and O2 file in 13s.
>> Thus even on the O0 file, the time is already too high because generaly
>> sparse is roughly 10 times faster than gcc -O2, here is twice as slow.
>> ...
>> 4) if we replace 'inline' by 'inline __attribute__((always_inline))'
>>    GCC needs roughly 58s to compile the O0 or O2 file.
>
> With the patch I sent, sparse now need 2.1s to compile the O2 file.

Wow, that is great. You are the man.

I will include that patch, I am going to take a closer look now.
Too tired last night.

Chris

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] fix: give a type to bad conditionnal expressions
  2017-08-04 12:31                                         ` Luc Van Oostenryck
@ 2017-08-04 14:52                                           ` Christopher Li
  2017-08-04 14:53                                           ` Christopher Li
  1 sibling, 0 replies; 33+ messages in thread
From: Christopher Li @ 2017-08-04 14:52 UTC (permalink / raw)
  To: Luc Van Oostenryck; +Cc: Linux-Sparse

On Fri, Aug 4, 2017 at 8:31 AM, Luc Van Oostenryck
<luc.vanoostenryck@gmail.com> wrote:
> Please ignore this patch as I must have forgot to recompile the last change.
> The problem is there and the solution too but bad_ctype won't solve
> the issue here.
>
> I'll send another version which will effectively solve the issue but
> I'm far from sure
> we'll want it.
>
> -- Luc
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sparse" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] fix: give a type to bad conditionnal expressions
  2017-08-04 12:31                                         ` Luc Van Oostenryck
  2017-08-04 14:52                                           ` Christopher Li
@ 2017-08-04 14:53                                           ` Christopher Li
  1 sibling, 0 replies; 33+ messages in thread
From: Christopher Li @ 2017-08-04 14:53 UTC (permalink / raw)
  To: Luc Van Oostenryck; +Cc: Linux-Sparse

On Fri, Aug 4, 2017 at 8:31 AM, Luc Van Oostenryck
<luc.vanoostenryck@gmail.com> wrote:
> Please ignore this patch as I must have forgot to recompile the last change.
> The problem is there and the solution too but bad_ctype won't solve
> the issue here.
>
> I'll send another version which will effectively solve the issue but
> I'm far from sure
> we'll want it.

Sorry I hit the send before I type up some reply in my last email.

I will wait for your new patch then.

Chris

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: ptrlist-iterator performance on one wine source file
  2017-08-04 14:51                                     ` Christopher Li
@ 2017-08-04 22:26                                       ` Luc Van Oostenryck
  2017-08-05  0:23                                         ` Christopher Li
  0 siblings, 1 reply; 33+ messages in thread
From: Luc Van Oostenryck @ 2017-08-04 22:26 UTC (permalink / raw)
  To: Christopher Li, Linus Torvalds; +Cc: Linux-Sparse, Dibyendu Majumdar

On Fri, Aug 4, 2017 at 4:51 PM, Christopher Li <sparse@chrisli.org> wrote:
> On Fri, Aug 4, 2017 at 7:33 AM, Luc Van Oostenryck
> <luc.vanoostenryck@gmail.com> wrote:
>> On Thu, Aug 03, 2017 at 11:49:08PM +0200, Luc Van Oostenryck wrote:
>>> 1) some numbers:
>>> - GCC compile both preprocessed files in .9s with -O2.
>>> - sparse check the O0 file in 1.93s and O2 file in 13s.
>>> Thus even on the O0 file, the time is already too high because generaly
>>> sparse is roughly 10 times faster than gcc -O2, here is twice as slow.
>>> ...
>>> 4) if we replace 'inline' by 'inline __attribute__((always_inline))'
>>>    GCC needs roughly 58s to compile the O0 or O2 file.
>>
>> With the patch I sent, sparse now need 2.1s to compile the O2 file.

I have investigated a little more because even 2.1 or 1.93s seems
too much to me.

My conclusion is that the file is (really too) big (but see the end).
For example, there is:
- about 1 million calls to clean_up_one_instruction
- and 2.6 million calls to  insn_compare()
OTOH there is only 56000 calls to try_to_cse()
and these results in 82000 calls to bb_dominates()
and 29000 calls to cse_one_instruction().

All this indicate that the CSE is rather efficient:
only 56000 real CSE checks, each calling roughly 3/2
calls to bb_dominates() and 1/2 calls to cse_one_insn().

And in fact, most of these calls are not even really expensive.

The real offending, taking about 75% of CPU time, is bb_dominates()
which while only directly called 82000 is a recursive function which
internally is called more than 71 million of time!
In other words, the mean recursion depth of bb_dominates() is 860,
which means that there are chains of bb->parent as long as 860.

By restricting the bb_dominates() in CSE to a reasonable depth of 32,
the compile time is reduced to .8s without changing a single bit in the
resulting code.

This may be a change we may consider for the future.

-- Luc

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: ptrlist-iterator performance on one wine source file
  2017-08-04 22:26                                       ` Luc Van Oostenryck
@ 2017-08-05  0:23                                         ` Christopher Li
  2017-08-05 10:05                                           ` Luc Van Oostenryck
  0 siblings, 1 reply; 33+ messages in thread
From: Christopher Li @ 2017-08-05  0:23 UTC (permalink / raw)
  To: Luc Van Oostenryck; +Cc: Linus Torvalds, Linux-Sparse, Dibyendu Majumdar

On Fri, Aug 4, 2017 at 6:26 PM, Luc Van Oostenryck
<luc.vanoostenryck@gmail.com> wrote:
>
> I have investigated a little more because even 2.1 or 1.93s seems
> too much to me.

Thanks for looking into this.

>
> My conclusion is that the file is (really too) big (but see the end).
> For example, there is:
> - about 1 million calls to clean_up_one_instruction
> - and 2.6 million calls to  insn_compare()
> OTOH there is only 56000 calls to try_to_cse()
> and these results in 82000 calls to bb_dominates()
> and 29000 calls to cse_one_instruction().
>
> All this indicate that the CSE is rather efficient:
> only 56000 real CSE checks, each calling roughly 3/2
> calls to bb_dominates() and 1/2 calls to cse_one_insn().
>
> And in fact, most of these calls are not even really expensive.
>
> The real offending, taking about 75% of CPU time, is bb_dominates()
> which while only directly called 82000 is a recursive function which
> internally is called more than 71 million of time!
> In other words, the mean recursion depth of bb_dominates() is 860,
> which means that there are chains of bb->parent as long as 860.

Yes, that is the impression I get as well for, some where the finding
dominates is taking a lot of time. Also the result is not saved.

>
> By restricting the bb_dominates() in CSE to a reasonable depth of 32,
> the compile time is reduced to .8s without changing a single bit in the
> resulting code.

We can revisit this after the release. I have side project work on finding
the dominatior tree.

Chris

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: ptrlist-iterator performance on one wine source file
  2017-08-05  0:23                                         ` Christopher Li
@ 2017-08-05 10:05                                           ` Luc Van Oostenryck
  0 siblings, 0 replies; 33+ messages in thread
From: Luc Van Oostenryck @ 2017-08-05 10:05 UTC (permalink / raw)
  To: Christopher Li; +Cc: Linus Torvalds, Linux-Sparse, Dibyendu Majumdar

On Sat, Aug 5, 2017 at 2:23 AM, Christopher Li <sparse@chrisli.org> wrote:
> On Fri, Aug 4, 2017 at 6:26 PM, Luc Van Oostenryck
> <luc.vanoostenryck@gmail.com> wrote:
>>
>> I have investigated a little more because even 2.1 or 1.93s seems
>> too much to me.
>
> Thanks for looking into this.
>
>> The real offending, taking about 75% of CPU time, is bb_dominates()
>> which while only directly called 82000 is a recursive function which
>> internally is called more than 71 million of time!
>> In other words, the mean recursion depth of bb_dominates() is 860,
>> which means that there are chains of bb->parent as long as 860.
>
> Yes, that is the impression I get as well for, some where the finding
> dominates is taking a lot of time. Also the result is not saved.
>
>>
>> By restricting the bb_dominates() in CSE to a reasonable depth of 32,
>> the compile time is reduced to .8s without changing a single bit in the
>> resulting code.
>
> We can revisit this after the release. I have side project work on finding
> the dominatior tree.

Don't forget that asking if this node dominates this other node is not
the same question as asking for the immediate dominator of a node.
In other words, even with the dominator tree you will still need to do
lookups in the tree.

-- Luc

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2017-08-05 10:05 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-27 15:05 ptrlist-iterator performance on one wine source file Christopher Li
2017-07-29 13:01 ` Luc Van Oostenryck
2017-07-29 15:53   ` Christopher Li
2017-07-29 16:04     ` Luc Van Oostenryck
2017-07-29 16:25       ` Christopher Li
2017-07-29 16:30         ` Christopher Li
2017-07-29 16:35         ` Luc Van Oostenryck
2017-07-29 19:33           ` Christopher Li
2017-07-29 21:47             ` Luc Van Oostenryck
2017-07-30  4:15               ` Christopher Li
2017-07-30 15:12                 ` Luc Van Oostenryck
2017-07-30 15:49                   ` Christopher Li
2017-07-30 16:16                     ` Luc Van Oostenryck
2017-08-01 20:33                       ` Luc Van Oostenryck
2017-08-01 21:09                         ` Christopher Li
2017-08-01 21:46                           ` Luc Van Oostenryck
2017-08-01 23:37                             ` Christopher Li
2017-08-02  0:42                               ` Christopher Li
     [not found]                             ` <CANeU7QmzundH7qpdYhQqDJgBv+5pPemwft+1uH5oVQ1POnoQDw@mail.gmail.com>
2017-08-02 22:50                               ` Luc Van Oostenryck
2017-08-03 21:49                                 ` Luc Van Oostenryck
2017-08-03 22:29                                   ` Luc Van Oostenryck
2017-08-03 22:35                                   ` Linus Torvalds
2017-08-04  0:04                                     ` Christopher Li
2017-08-04  0:11                                     ` Luc Van Oostenryck
2017-08-04  0:16                                       ` [PATCH] fix: give a type to bad conditionnal expressions Luc Van Oostenryck
2017-08-04 12:31                                         ` Luc Van Oostenryck
2017-08-04 14:52                                           ` Christopher Li
2017-08-04 14:53                                           ` Christopher Li
2017-08-04 11:33                                   ` ptrlist-iterator performance on one wine source file Luc Van Oostenryck
2017-08-04 14:51                                     ` Christopher Li
2017-08-04 22:26                                       ` Luc Van Oostenryck
2017-08-05  0:23                                         ` Christopher Li
2017-08-05 10:05                                           ` Luc Van Oostenryck

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.