Re: [Cocci] [PATCH v2 0/3] parsing_c: Optimize recursive header file parsing

From: Julia Lawall <julia.lawall@inria.fr>
To: Jaskaran Singh <jaskaran.singh@collabora.com>
Cc: cocci@systeme.lip6.fr
Subject: Re: [Cocci] [PATCH v2 0/3] parsing_c: Optimize recursive header file parsing
Date: Tue, 15 Sep 2020 21:23:33 +0200 (CEST)	[thread overview]
Message-ID: <alpine.DEB.2.22.394.2009152123150.6185@hadrien> (raw)
In-Reply-To: <20200910061703.2397-1-jaskaran.singh@collabora.com>

On Thu, 10 Sep 2020, Jaskaran Singh wrote:

> This patch series aims to optimize performance for recursively parsing
> header files in Coccinelle.
>
> Coccinelle's C parsing subsystem has an option called --recursive-includes
> to recursively parse header files. This is used for type
> inference/annotation.
>
> Previously, using --recursive-includes on the entire Linux kernel source
> code would take far too long. On my computer with the following specs,
> 	- Processor: AMD Ryzen 5 3550H
> 	- RAM: 8 GB
> it would take close to 7 hours to complete.  The optimization that this
> patch series implements reduces that time to 1 hour.
>
> The following is a high-level description of what has been implemented:
> - As header files are recursively parsed, they are scanned for the
>   following:
> 	- fields of structs/unions/enums
> 	- typedefs
> 	- function prototypes
> 	- global variables
>   The names of the above are stored in a "name cache", i.e. a hashtable to
>   map the name to the files it is declared in.
> - A dependency graph is built to determine dependencies between all the
>   files in the codebase.
> - In the type annotation phase of the C subsystem, if a function call,
>   struct/union field or identifier is encountered, the type of which is
>   not known to the annoter, the name cache is checked for the name.
> - The name cache gives a list of files that the name is declared/defined
>   in.  These files are cross checked with the dependency graph to
>   determine if any of these are reachable by the file that the annoter is
>   working on.
> - If a reachable header file is found, that file is parsed and the type
>   associated to the name is returned.
>
> Different approaches that were attempted to alleviate this issue, and the
> problems with each are as follows:
> - Caching the most recently used files: A LRU cache to store ASTs of the
>   most recently encountered header files. The problem with this approach
>   is the amount of memory it takes to cache the header file ASTs.
> - Caching the most troublesome files: A pseudo-LFU cache to store files
>   that cumulatively take the longest to parse, and thus bloat the time
>   taken. The problem with this approach is the amount of memory it takes
>   to cache the header file ASTs.
> - Skipping unparsable locations in header files: Skipping top-level items
>   in a header file that cannot be parsed. This approach does not produce
>   even close to the amount of optimization needed.
>
> The next step from here would be:
> - Maintain a small but persistent cache of header files in groups of
>   directories. Leverage multiprocessing for parsing these header files.
> - Leverage multiprocessing to parse header files initially for name
>   extraction.
> - Performing some initial matching with the semantic patch to determine if
>   a C file matches. If matches are found, call the annoter and recursively
>   parse header files for type annotation.
> - Recursively parse all header files only once and build a large type
>   environment. Use the dependency graph to determine reachability. This
>   has potential memory usage issues though.

Applied.

>
>
> Changes in v2:
> --------------
> - Change occurences of 'begin' and 'match' on the same line with something else
>   to the next line for better readability.
>
>
>  Makefile                     |    2
>  parsing_c/includes_cache.ml  |  286 +++++++++++++++++++++++++++++++++++++++++++
>  parsing_c/includes_cache.mli |   47 +++++++
>  parsing_c/parse_c.ml         |   27 +++-
>  parsing_c/type_annoter_c.ml  |  130 ++++++++++++++++---
>  5 files changed, 466 insertions(+), 26 deletions(-)
>
>
>
_______________________________________________
Cocci mailing list
Cocci@systeme.lip6.fr
https://systeme.lip6.fr/mailman/listinfo/cocci