From: Jaskaran Singh <jaskaran.singh@collabora.com>
To: cocci@systeme.lip6.fr
Subject: [Cocci] [PATCH v2 0/3] parsing_c: Optimize recursive header file parsing
Date: Thu, 10 Sep 2020 11:47:00 +0530 [thread overview]
Message-ID: <20200910061703.2397-1-jaskaran.singh@collabora.com> (raw)
This patch series aims to optimize performance for recursively parsing
header files in Coccinelle.
Coccinelle's C parsing subsystem has an option called --recursive-includes
to recursively parse header files. This is used for type
inference/annotation.
Previously, using --recursive-includes on the entire Linux kernel source
code would take far too long. On my computer with the following specs,
- Processor: AMD Ryzen 5 3550H
- RAM: 8 GB
it would take close to 7 hours to complete. The optimization that this
patch series implements reduces that time to 1 hour.
The following is a high-level description of what has been implemented:
- As header files are recursively parsed, they are scanned for the
following:
- fields of structs/unions/enums
- typedefs
- function prototypes
- global variables
The names of the above are stored in a "name cache", i.e. a hashtable to
map the name to the files it is declared in.
- A dependency graph is built to determine dependencies between all the
files in the codebase.
- In the type annotation phase of the C subsystem, if a function call,
struct/union field or identifier is encountered, the type of which is
not known to the annoter, the name cache is checked for the name.
- The name cache gives a list of files that the name is declared/defined
in. These files are cross checked with the dependency graph to
determine if any of these are reachable by the file that the annoter is
working on.
- If a reachable header file is found, that file is parsed and the type
associated to the name is returned.
Different approaches that were attempted to alleviate this issue, and the
problems with each are as follows:
- Caching the most recently used files: A LRU cache to store ASTs of the
most recently encountered header files. The problem with this approach
is the amount of memory it takes to cache the header file ASTs.
- Caching the most troublesome files: A pseudo-LFU cache to store files
that cumulatively take the longest to parse, and thus bloat the time
taken. The problem with this approach is the amount of memory it takes
to cache the header file ASTs.
- Skipping unparsable locations in header files: Skipping top-level items
in a header file that cannot be parsed. This approach does not produce
even close to the amount of optimization needed.
The next step from here would be:
- Maintain a small but persistent cache of header files in groups of
directories. Leverage multiprocessing for parsing these header files.
- Leverage multiprocessing to parse header files initially for name
extraction.
- Performing some initial matching with the semantic patch to determine if
a C file matches. If matches are found, call the annoter and recursively
parse header files for type annotation.
- Recursively parse all header files only once and build a large type
environment. Use the dependency graph to determine reachability. This
has potential memory usage issues though.
Changes in v2:
--------------
- Change occurences of 'begin' and 'match' on the same line with something else
to the next line for better readability.
Makefile | 2
parsing_c/includes_cache.ml | 286 +++++++++++++++++++++++++++++++++++++++++++
parsing_c/includes_cache.mli | 47 +++++++
parsing_c/parse_c.ml | 27 +++-
parsing_c/type_annoter_c.ml | 130 ++++++++++++++++---
5 files changed, 466 insertions(+), 26 deletions(-)
_______________________________________________
Cocci mailing list
Cocci@systeme.lip6.fr
https://systeme.lip6.fr/mailman/listinfo/cocci
next reply other threads:[~2020-09-10 6:23 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-09-10 6:17 Jaskaran Singh [this message]
2020-09-10 6:17 ` [Cocci] [PATCH v2 1/3] parsing_c: includes_cache: Implement a name cache Jaskaran Singh
2020-09-10 6:17 ` [Cocci] [PATCH v2 2/3] parsing_c: parse_c: Build name cache and includes dependency graph Jaskaran Singh
2020-09-10 6:17 ` [Cocci] [PATCH v2 3/3] parsing_c: type_annoter_c: Use name cache for type annotation Jaskaran Singh
2020-09-10 6:19 ` [Cocci] [PATCH v2 0/3] parsing_c: Optimize recursive header file parsing Jaskaran Singh
2020-09-15 19:23 ` Julia Lawall
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200910061703.2397-1-jaskaran.singh@collabora.com \
--to=jaskaran.singh@collabora.com \
--cc=cocci@systeme.lip6.fr \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).