Re: dependency tee from c parser entities downto token

From: Konrad Eisele <eiselekd@gmail.com>
To: Christopher Li <sparse@chrisli.org>
Cc: Konrad Eisele <konrad@gaisler.com>, linux-sparse@vger.kernel.org
Subject: Re: dependency tee from c parser entities downto token
Date: Sat, 5 May 2012 18:59:39 +0200	[thread overview]
Message-ID: <CAEjhO7JqWNwGzEYX+xr2gvuCTcnNPovNgk3X1q69E5_CMPnSTw@mail.gmail.com> (raw)
In-Reply-To: <CANeU7QnNemo=iyx3+VJBgEzf4kOacxOx9JTsbu5Of46+3NkAJA@mail.gmail.com>

>
> I am not sure I understand your range representation yet.
>

You need to view it with a fixed width font. Its not rocket science,
token lists (or arrays) are viewed as dotted lists. The token.pos
field is listed below each token as p[x] or as the file-location
in file-scope.

I'll come up with a patch to implement this scheme when I have
time to and send it, it might take a while.
-- Konrad

> To be continue...
>
> Chris
>
>
>>
>> Note that a reference to p[] in p[x] notation only references
>> the "start" of the  PP_struct.copy. An uique identification
>> of the "source" token might not always be possible because
>> of disambiguities, so when doing a copy of the  tokens in
>> PP_struct.copy I might use an extended version of struct token
>> to also include an offset.
>>
>> ----- file a.h start -----
>> #define D0(d0a0,d0a1) 1 D1(d0a0) 2 D2(d0a1) 3
>> #define D1(d1a0) 4 d1a0 5
>> #define D2(d2a0) 6 d2a0 7
>> #define D3(d3a0) 8 d3a0 9
>> D0(D3(10),11)
>> ----- file a.h end   .....
>>
>> Preprocessor output (gcc -E a.h): "1 4 8 10 9 5 2 6 11 7 3"
>>
>> PreProcessor macro trace on p[]:
>>
>> p[0]:mdefn_body[D0]     :1.D1.(.d0a0.).2.D2.(.d0a1.).3
>>                         [ a.h:1:23     ..   a.h:1:45]
>> p[1]:mdefn_body[D1]     :4   .   d1a0   .    5
>>                         [ a.h:2:18..a.h:2:25]
>> p[2]:mdefn_body[D2]     :6   .   d2a0   .    7
>>                         [ a.h:3:18..a.h:3:25]
>> p[3]:mdefn_body[D3]     :8   .   d3a0   .    9
>>                         [ a.h:4:18..a.h:4:25]
>> p[4]:minst_arg0[D0]     :D3  . (  .   10 . )
>>                         [ a.h:5:4..a.h:5:9]
>> p[5]:minst_arg1[D0]     :11
>>                         [a.h:5:11]
>> p[6]:minst_arg0[D3]     :10
>>                         p[4]
>> p[7]:(args)expand[p[3]] :8    .  10   .  9
>>                         p[3]    p[4]    p[3]
>> p[8]:minst_arg0[d2]     :11
>>                         p[5]
>> p[9]:(body)expand[p[2]] :6   .   11   .    7
>>                         p[2]    p[5]      p[2]
>> p[10]:(body)expand[p[0]]:1  .4  .8  .10 .9  .5  .2  .6  .11 .7  .3
>>                         p[0]p[1]p[7]p[7]p[7]p[1]p[0]p[9]p[9]p[9]p[0]
>>
>>
>> p[0]-p[3] are build up when the macro is defined.
>>          A p[] entry is needed to destinguish between
>>          the different sources of tokens.
>> p[4],p[5] is build in collect_arguments() for D0(D3(10),11)
>> p[6]      is build in collect_arguments() for D3(10)
>> p[7]      is build in call to macro_expand() hook with flag that
>>          it is a (args)expand
>> p[8]      is build in collect_arguments() for D2(11)
>>          (inside D0's expansion
>> p[9]      is build in call to macro_expand() hook with flag that
>>          it is a (body)expand (of D2)
>> p[10]     is build in call to macro_expand() hook with flag that
>>          it is a (body)expand (of D0)
>>
>> PP_struct {
>>          enum {minst_arg, expand_body, expand_arg, mdef_body} typ;
>>          uint argidx;
>>          struct symbol *macro;
>>          struct token copy[];
>> };
>>
>> Conclusion:
>> -----------
>> Apart from the macro_expand() hook I also need hooks
>> in macro definition and also in collect_arguments() or expand().
>>
>>
>> Concerning (3) How to connect (1) and (2) to the AST
>> ----------------------------------------------------
>>
>> can maybe wait for later iteration. There are more complex parts
>> involved...
>>
>>
>>
>>>
>>> Now how to connect the AST tree with those information is a
>>> very good question. Notice the symbol->aux pointer? That is
>>> the place to attach extra context or back end related data
>>> to symbols.
>>>
>>> Because each symbol has "pos" and "endpos". If the symbol
>>> is expand from macro, using the previous scheme, the pos
>>> should point to a line in the "<pre-processor>" stream.
>>>
>>> However, if the macro expand is happen between "pos" and
>>> "endpos", you will not able to access the token that contain
>>> the macro expand "pos" easily.
>>>
>>> For that, we could, just thinking it out loud, add a parser
>>> hook for declares when a symbol is complete building.
>>> That would a very small and straight forward change.
>>> If the hook is not NULL, the call back function will be call
>>> with the symbol that just get defined, and the start and end
>>> token of that symbol.
>>>
>>> So your dependence program just need to register the
>>> symbol parsing hook. In side the call back function, walk
>>> the token from start to end. Look up macro expand information
>>> is needed. Build up the dependency struct and store that in
>>> symbol->aux.
>>>
>>> BTW, unrelated to this patch, I can see other program might
>>> be able to use the same parser hook to perform source code
>>> transformations as well.
>>>
>>> Make sense? In this way, you don't even need the hash
>>> table to attach a context into the token. You can get it directly
>>> from symbol->aux.
>>>
>>>> In my patch I have modeled (2) using 2 structs:
>>>> struct macro_expansion {
>>>>        int nargs;
>>>>        struct symbol *sym;
>>>>        struct token *m;
>>>>        struct arg args[0];
>>>> };
>>>> struct tok_macro_dep {
>>>>        struct macro_expansion *m;
>>>>        unsigned int argi;
>>>>        unsigned int isbody : 1;
>>>>        unsigned int visited : 1;
>>>> };
>>>> Each token from a macro expansion gets tagged with
>>>> tok_macro_dep. If it is an macro argument,<argi>  shows the
>>>> index, if it is from the macro body<isbody>  is 1.
>>>> Now, I didnt already think about special cases like
>>>> token concaternation, even more data is needed to
>>>> model this. Also when an macro argument is again used as an
>>>> macro argument inside the body expansion, then I kindof
>>>> loose the chain: I would also need a "token *dup_of" pointer
>>>> to point to the original token that the token is a copy
>>>> of (when arguments are created...) etc.
>>>>
>>>> I have read your macro_expand() hook idea, however
>>>> when I understand it right you want to reuse position.stream and
>>>> position.line as a kind of pointer (to save the extra 4 bytes).
>>>> (Your goal is to minimize codebase change, however I wonder
>>>> weather you dont change semantic of struct position and then
>>>> need to change the code that uses struct position anyway...)
>>>
>>>
>>> Nope, because the position.stream change is only happen on
>>> your dependency analyse program. It is the dependency program
>>> register the hook to it. This behaviour is private to the dependency
>>> analyse program. Other program that use sparse library don't see
>>> it at all, because they don't register macro_expand hooks to perform
>>> those stream manipulations. It will receive the exact AST as before.
>>>
>>>> Maybe it is possible like this...I doubt it, where should
>>>> all the extra context, that each token has, be saved and
>>>> extracted from? using that sheme...
>>>
>>>
>>> Two places, one is symbol->aux. Also the macro_expand
>>> can be lookup by pos->line. That will index into the macro_expand
>>> array which store the context.
>>>
>>> Having this two should be enough to put the exact same
>>> dependency result as you are doing right now.
>>>
>>>> Maybe it is possible but I dont want to have as a design
>>>> goal to save 4 bytes (I'd use the void *custom sheme to
>>>> save all my extra data, also the pointers to tokens to
>>>> "sit around") and adujust everything else to
>>>> that. The consequence is that the code-complexity would
>>>> grow on the other end.
>>>
>>>
>>> It is not only about saving 4 bytes. It is about other program
>>> don't have to suck in the full token struct if they don't need to.
>>> It is about re-usable macro hooks and parser hooks that
>>> external program can do more fancy stuff like source code transformations
>>> without impacting the other user of the sparse lib.
>>>
>>>> Here is my compromise then:
>>>> Keep the orignial "pos". But still grant me for
>>>> each struct a "void *custom" pointer that I can use
>>>> to store extradata i.e. pointer to token.
>>>
>>>
>>> symbol->aux.
>>>
>>> Chris
>>>
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-sparse" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html