cocci.inria.fr archive mirror
 help / color / mirror / Atom feed
* [Cocci] [PATCH v2 0/3] parsing_c: Optimize recursive header file parsing
@ 2020-09-10  6:17 Jaskaran Singh
  2020-09-10  6:17 ` [Cocci] [PATCH v2 1/3] parsing_c: includes_cache: Implement a name cache Jaskaran Singh
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: Jaskaran Singh @ 2020-09-10  6:17 UTC (permalink / raw)
  To: cocci

This patch series aims to optimize performance for recursively parsing
header files in Coccinelle.

Coccinelle's C parsing subsystem has an option called --recursive-includes
to recursively parse header files. This is used for type
inference/annotation.

Previously, using --recursive-includes on the entire Linux kernel source
code would take far too long. On my computer with the following specs,
	- Processor: AMD Ryzen 5 3550H
	- RAM: 8 GB
it would take close to 7 hours to complete.  The optimization that this
patch series implements reduces that time to 1 hour.

The following is a high-level description of what has been implemented:
- As header files are recursively parsed, they are scanned for the
  following:
	- fields of structs/unions/enums
	- typedefs
	- function prototypes
	- global variables
  The names of the above are stored in a "name cache", i.e. a hashtable to
  map the name to the files it is declared in.
- A dependency graph is built to determine dependencies between all the
  files in the codebase.
- In the type annotation phase of the C subsystem, if a function call,
  struct/union field or identifier is encountered, the type of which is
  not known to the annoter, the name cache is checked for the name.
- The name cache gives a list of files that the name is declared/defined
  in.  These files are cross checked with the dependency graph to
  determine if any of these are reachable by the file that the annoter is
  working on.
- If a reachable header file is found, that file is parsed and the type
  associated to the name is returned.

Different approaches that were attempted to alleviate this issue, and the
problems with each are as follows:
- Caching the most recently used files: A LRU cache to store ASTs of the
  most recently encountered header files. The problem with this approach
  is the amount of memory it takes to cache the header file ASTs.
- Caching the most troublesome files: A pseudo-LFU cache to store files
  that cumulatively take the longest to parse, and thus bloat the time
  taken. The problem with this approach is the amount of memory it takes
  to cache the header file ASTs.
- Skipping unparsable locations in header files: Skipping top-level items
  in a header file that cannot be parsed. This approach does not produce
  even close to the amount of optimization needed.

The next step from here would be:
- Maintain a small but persistent cache of header files in groups of
  directories. Leverage multiprocessing for parsing these header files.
- Leverage multiprocessing to parse header files initially for name
  extraction.
- Performing some initial matching with the semantic patch to determine if
  a C file matches. If matches are found, call the annoter and recursively
  parse header files for type annotation.
- Recursively parse all header files only once and build a large type
  environment. Use the dependency graph to determine reachability. This
  has potential memory usage issues though.


Changes in v2:
--------------
- Change occurences of 'begin' and 'match' on the same line with something else
  to the next line for better readability.


 Makefile                     |    2 
 parsing_c/includes_cache.ml  |  286 +++++++++++++++++++++++++++++++++++++++++++
 parsing_c/includes_cache.mli |   47 +++++++
 parsing_c/parse_c.ml         |   27 +++-
 parsing_c/type_annoter_c.ml  |  130 ++++++++++++++++---
 5 files changed, 466 insertions(+), 26 deletions(-)


_______________________________________________
Cocci mailing list
Cocci@systeme.lip6.fr
https://systeme.lip6.fr/mailman/listinfo/cocci

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Cocci] [PATCH v2 1/3] parsing_c: includes_cache: Implement a name cache
  2020-09-10  6:17 [Cocci] [PATCH v2 0/3] parsing_c: Optimize recursive header file parsing Jaskaran Singh
@ 2020-09-10  6:17 ` Jaskaran Singh
  2020-09-10  6:17 ` [Cocci] [PATCH v2 2/3] parsing_c: parse_c: Build name cache and includes dependency graph Jaskaran Singh
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Jaskaran Singh @ 2020-09-10  6:17 UTC (permalink / raw)
  To: cocci

Implement a name cache and includes dependency graph to optimize
performance for recursive parsing of header files.

The following is a high-level description of what has been implemented:
- As header files are recursively parsed, they are scanned for the
  following:
	- fields of structs/unions/enums
	- typedefs
	- function prototypes
	- global variables
  The names of the above are stored in a "name cache", i.e. a hashtable
  to map the name to the files it is declared in.
- A dependency graph is built to determine dependencies between all the
  files in the codebase.
- In the type annotation phase of the C subsystem, if a function call,
  struct/union field or identifier is encountered, the type of which is
  not known to the annoter, the name cache is checked for the name.
- The name cache gives a list of files that the name is declared/defined
  in.  These files are cross checked with the dependency graph to
  determine if any of these are reachable by the file that the annoter is
  working on.
- If a reachable header file is found, that file is parsed and all of
  the above listed constructs are extracted from it.

Suggested-by: Julia Lawall <julia.lawall@inria.fr>
Signed-off-by: Jaskaran Singh <jaskaran.singh@collabora.com>
---
 Makefile                     |   2 +-
 parsing_c/includes_cache.ml  | 290 +++++++++++++++++++++++++++++++++++
 parsing_c/includes_cache.mli |  47 ++++++
 3 files changed, 338 insertions(+), 1 deletion(-)
 create mode 100644 parsing_c/includes_cache.ml
 create mode 100644 parsing_c/includes_cache.mli

diff --git a/Makefile b/Makefile
index e25174413..f8d3424c0 100644
--- a/Makefile
+++ b/Makefile
@@ -50,7 +50,7 @@ SOURCES_parsing_cocci := \
 SOURCES_parsing_c := \
 	token_annot.ml flag_parsing_c.ml parsing_stat.ml \
 	token_c.ml ast_c.ml includes.ml control_flow_c.ml \
-	visitor_c.ml lib_parsing_c.ml control_flow_c_build.ml \
+	visitor_c.ml lib_parsing_c.ml includes_cache.ml control_flow_c_build.ml \
 	pretty_print_c.ml semantic_c.ml lexer_parser.ml parser_c.mly \
 	lexer_c.mll parse_string_c.ml token_helpers.ml token_views_c.ml \
 	cpp_token_c.ml parsing_hacks.ml cpp_analysis_c.ml \
diff --git a/parsing_c/includes_cache.ml b/parsing_c/includes_cache.ml
new file mode 100644
index 000000000..ca5c91822
--- /dev/null
+++ b/parsing_c/includes_cache.ml
@@ -0,0 +1,290 @@
+(*
+ * This file is part of Coccinelle, licensed under the terms of the GPL v2.
+ * See copyright.txt in the Coccinelle source code for more information.
+ * The Coccinelle source code can be obtained at http://coccinelle.lip6.fr
+ *)
+
+open Common
+module Lib = Lib_parsing_c
+
+(*****************************************************************************)
+(* Wrappers *)
+(*****************************************************************************)
+let pr_inc s =
+  if !Flag_parsing_c.verbose_includes
+  then Common.pr2 s
+
+(*****************************************************************************)
+(* Graph types/modules *)
+(*****************************************************************************)
+
+(* Filenames as keys to check paths from file A to file B. *)
+module Key : Set.OrderedType with type t = Common.filename = struct
+  type t = Common.filename
+  let compare = String.compare
+end
+
+module KeySet = Set.Make (Key)
+
+module KeyMap = Map.Make (Key)
+
+module Node : Set.OrderedType with type t = unit = struct
+  type t = unit
+  let compare = compare
+end
+
+module Edge : Set.OrderedType with type t = unit = struct
+  type t = unit
+  let compare = compare
+end
+
+module KeyEdgePair : Set.OrderedType with type t = Key.t * Edge.t =
+struct
+  type t = Key.t * Edge.t
+  let compare = compare
+end
+
+module KeyEdgeSet = Set.Make (KeyEdgePair)
+
+module G = Ograph_simple.Make
+  (Key) (KeySet) (KeyMap) (Node) (Edge) (KeyEdgePair) (KeyEdgeSet)
+
+
+(*****************************************************************************)
+(* Includes dependency graph *)
+(*****************************************************************************)
+
+(* Header file includes dependency graph *)
+let dependency_graph = ref (new G.ograph_mutable)
+
+(* Check if a path exists between one node to another.
+ * Almost a copy of dfs_iter in commons/ograph_extended.ml with minor changes.
+ * Return true if g satisfies predicate f else false
+ *)
+let dfs_exists xi f g =
+  let already = Hashtbl.create 101 in
+  let rec aux_dfs xs =
+    let h xi =
+      if Hashtbl.mem already xi
+      then false
+      else
+        begin
+          Hashtbl.add already xi true;
+          if f xi
+          then true
+          else
+            begin
+              let f' (key, _) keyset = KeySet.add key keyset in
+              let newset =
+                try KeyEdgeSet.fold f' (g#successors xi) KeySet.empty
+                with Not_found -> KeySet.empty in
+              aux_dfs newset
+            end
+        end in
+    KeySet.exists h xs in
+  aux_dfs (KeySet.singleton xi)
+
+let add_to_dependency_graph parent file =
+  let add_node a =
+    if not (KeyMap.mem a !dependency_graph#nodes)
+    then !dependency_graph#add_node a () in
+  let add_arc (a, b) =
+    add_node a;
+    add_node b;
+    !dependency_graph#add_arc (a, b) () in
+  add_arc (parent, file)
+
+let path_exists_dependency_graph filea fileb =
+  dfs_exists filea (fun a -> a = fileb) !dependency_graph
+
+let print_dependency_graph _ =
+  G.print_ograph_generic
+    ~str_of_key:(fun k -> "\"" ^ k ^ "\"")
+    ~str_of_node:(fun k _ -> k)
+    "/tmp/headers.dot"
+    !dependency_graph
+
+
+(*****************************************************************************)
+(* Name cache *)
+(*****************************************************************************)
+
+type cache_exp =
+  | CacheField of string (* Name of the struct/union it is defined in *)
+  | CacheEnumConst
+  | CacheVarFunc
+  | CacheTypedef
+
+(* Not very elegant. Basically a copy-paste of namedef in type_annoter_c. *)
+type cache_return =
+  | RetVarOrFunc of string * Ast_c.exp_type
+  | RetEnumConstant of string * string option
+  | RetTypeDef   of string * Ast_c.fullType
+  | RetStructUnionNameDef of string * (Ast_c.structUnion * Ast_c.structType)
+                          Ast_c.wrap
+
+(* Map name to list of files it is defined in *)
+let name_cache : (string, (filename * cache_exp) list) Hashtbl.t ref =
+  ref (Hashtbl.create 101)
+
+let add_to_name_cache name (file, exp) =
+  let l =
+    try Hashtbl.find !name_cache name
+    with Not_found -> [] in
+  Hashtbl.replace !name_cache name ([(file, exp)] @ l)
+
+(* Visitor to cache names in a given file. *)
+let cache_name_visitor file =
+
+  let cache_struct_fields sname def =
+    let _cache_field field =
+      match (Ast_c.unwrap field) with
+        Ast_c.Simple (name, _)
+      | Ast_c.BitField (name, _, _, _) ->
+          name +>
+            do_option
+              (fun n ->
+                 add_to_name_cache (Ast_c.str_of_name n)
+                 (file, CacheField(sname))) in
+    let _cache_struct_fields field =
+      match field with
+        Ast_c.DeclarationField(Ast_c.FieldDeclList(l)) ->
+          List.iter _cache_field (Ast_c.unwrap l)
+      | _ -> () in
+    List.iter _cache_struct_fields def in
+
+  let cache_enum_constants def =
+    def +>
+      List.iter
+        (fun ec ->
+           add_to_name_cache
+             ((Ast_c.unwrap ec) +> fst +> Ast_c.str_of_name)
+             (file, CacheEnumConst)) in
+
+  { Visitor_c.default_visitor_c with
+    Visitor_c.ktoplevel = fun (k, bigf) p ->
+      match p with
+        Ast_c.Declaration
+          (Ast_c.DeclList (defs, _)) ->
+             let cache_name x =
+               (match (Ast_c.unwrap x) with
+                 {Ast_c.v_namei = Some (name, _);
+                  Ast_c.v_type = typ;
+                  Ast_c.v_storage = strg} ->
+                  (* Cache typedefs/variables/functions *)
+                    let exp_type =
+                      match (fst strg) with
+                        Ast_c.StoTypedef -> CacheTypedef
+                      | _ -> CacheVarFunc in
+                    add_to_name_cache
+                      (Ast_c.str_of_name name) (file, exp_type)
+               | {Ast_c.v_namei = None; Ast_c.v_type = typ} ->
+                    (match (Ast_c.unwrap (snd typ)) with
+                      Ast_c.StructUnion (_, Some n, def) ->
+                        (* Cache field names *)
+                        cache_struct_fields n def
+                    | Ast_c.Enum(_, def) ->
+                        (* Cache enumeration constants *)
+                        cache_enum_constants def
+                    | _ -> k p)) in
+             List.iter cache_name defs
+      | _ -> k p
+  }
+
+
+let get_type_visitor file l =
+  let add_to_ret ret =
+    l := [ret] @ !l in
+  let get_enum_constants sopt def =
+    def +>
+      List.iter
+        (fun ec ->
+           let s = (Ast_c.unwrap ec) +> fst +> Ast_c.str_of_name in
+           add_to_ret (RetEnumConstant(s, sopt))) in
+  { Visitor_c.default_visitor_c with
+    Visitor_c.ktoplevel = fun (k, bigf) p ->
+      match p with
+        Ast_c.Declaration
+          (Ast_c.DeclList (defs, _)) ->
+             let get_name x =
+               (match (Ast_c.unwrap x) with
+                 {Ast_c.v_namei = Some (n, _);
+                  Ast_c.v_type = typ;
+                  Ast_c.v_storage = strg} ->
+                  (* Cache typedefs/variables/functions *)
+                    let f _ =
+                      let s = Ast_c.str_of_name n in
+                      match (fst strg) with
+                        Ast_c.StoTypedef ->
+                          add_to_ret (RetTypeDef(s, typ))
+                      | Ast_c.NoSto
+                      | Ast_c.Sto _ ->
+                          add_to_ret
+                            (RetVarOrFunc(s, (typ, Ast_c.NotLocalVar))) in
+                    f ()
+               | {Ast_c.v_namei = None; Ast_c.v_type = typ} ->
+                   (match (Ast_c.unwrap (snd typ)) with
+                     Ast_c.StructUnion (su, snameopt, def) ->
+                       (match snameopt with
+                         Some s ->
+                           let def' = Lib.al_fields def in
+                           let ii = Ast_c.get_ii_typeC_take_care (snd typ) in
+                           let ii' = Lib.al_ii ii in
+                           add_to_ret
+                             (RetStructUnionNameDef (s, ((su, def'),ii')))
+                       | None -> k p)
+                   | Ast_c.Enum(sopt, def) ->
+                       get_enum_constants sopt def
+                   | _ -> k p)) in
+             List.iter get_name defs
+      | _ -> k p
+  }
+
+let extract_names file ast =
+  ignore (Visitor_c.vk_program (cache_name_visitor file) ast)
+
+(* Use the visitor to get all the types from a file. *)
+let get_types_from_name_cache file name exp_types parse_fn =
+  if !Includes.include_headers_for_types && Includes.is_header file
+  then []
+  else
+    let rec find_file l =
+      match l with
+        [] -> None
+      | (x, y)::xs ->
+          if List.mem y exp_types
+          then
+            begin
+              if path_exists_dependency_graph file x
+              then Some (x, y)
+              else find_file xs
+            end
+          else find_file xs in
+    let file_list =
+      try Hashtbl.find !name_cache name
+      with Not_found -> [] in
+    match (find_file file_list) with
+      None -> []
+    | Some (x, t) ->
+        let ast = parse_fn x in
+        let env = ref [] in
+        ignore
+          (Visitor_c.vk_program
+            (get_type_visitor file env) ast);
+        !env
+
+
+(*****************************************************************************)
+(* Set of parsed files *)
+(*****************************************************************************)
+
+module StringSet = Set.Make (String)
+
+(* Set of files parsed *)
+let seen_files = ref (StringSet.empty)
+
+let has_been_parsed file =
+  StringSet.mem file !seen_files
+
+let add_to_parsed_files file =
+  seen_files := StringSet.add file !seen_files
diff --git a/parsing_c/includes_cache.mli b/parsing_c/includes_cache.mli
new file mode 100644
index 000000000..2384cb2f7
--- /dev/null
+++ b/parsing_c/includes_cache.mli
@@ -0,0 +1,47 @@
+(*
+ * This file is part of Coccinelle, licensed under the terms of the GPL v2.
+ * See copyright.txt in the Coccinelle source code for more information.
+ * The Coccinelle source code can be obtained at http://coccinelle.lip6.fr
+ *)
+
+(*****************************************************************************)
+(* Includes dependency graph *)
+(*****************************************************************************)
+
+val add_to_dependency_graph : Common.filename -> Common.filename -> unit
+
+val print_dependency_graph : unit -> unit
+
+(*****************************************************************************)
+(* Name cache *)
+(*****************************************************************************)
+
+type cache_exp =
+  | CacheField of string (* Name of the struct/union it is defined in *)
+  | CacheEnumConst
+  | CacheVarFunc
+  | CacheTypedef
+
+type cache_return =
+  | RetVarOrFunc of string * Ast_c.exp_type
+  | RetEnumConstant of string * string option
+  | RetTypeDef   of string * Ast_c.fullType
+  | RetStructUnionNameDef of string * (Ast_c.structUnion * Ast_c.structType)
+                          Ast_c.wrap
+
+val extract_names : Common.filename -> Ast_c.program -> unit
+
+val get_types_from_name_cache :
+  Common.filename ->
+    string ->
+    cache_exp list ->
+    (Common.filename -> Ast_c.program) ->
+    cache_return list
+
+(*****************************************************************************)
+(* Set of parsed files *)
+(*****************************************************************************)
+
+val has_been_parsed : Common.filename -> bool
+
+val add_to_parsed_files : Common.filename -> unit
-- 
2.21.3

_______________________________________________
Cocci mailing list
Cocci@systeme.lip6.fr
https://systeme.lip6.fr/mailman/listinfo/cocci

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [Cocci] [PATCH v2 2/3] parsing_c: parse_c: Build name cache and includes dependency graph
  2020-09-10  6:17 [Cocci] [PATCH v2 0/3] parsing_c: Optimize recursive header file parsing Jaskaran Singh
  2020-09-10  6:17 ` [Cocci] [PATCH v2 1/3] parsing_c: includes_cache: Implement a name cache Jaskaran Singh
@ 2020-09-10  6:17 ` Jaskaran Singh
  2020-09-10  6:17 ` [Cocci] [PATCH v2 3/3] parsing_c: type_annoter_c: Use name cache for type annotation Jaskaran Singh
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Jaskaran Singh @ 2020-09-10  6:17 UTC (permalink / raw)
  To: cocci

Build the includes dependency graph and name cache while parsing header
files. Every header file is parsed only once for name caching and, while
parsing these files, an includes dependency graph is built to determine
reachability of one header file from another file.

Signed-off-by: Jaskaran Singh <jaskaran.singh@collabora.com>
---
 parsing_c/parse_c.ml | 28 ++++++++++++++++++++++------
 1 file changed, 22 insertions(+), 6 deletions(-)

diff --git a/parsing_c/parse_c.ml b/parsing_c/parse_c.ml
index 5574cb11b..ef5870123 100644
--- a/parsing_c/parse_c.ml
+++ b/parsing_c/parse_c.ml
@@ -17,6 +17,7 @@ open Common
 
 module TH = Token_helpers
 module LP = Lexer_parser
+module IC = Includes_cache
 
 module Stat = Parsing_stat
 
@@ -995,15 +996,30 @@ let rec _parse_print_error_heuristic2 saved_typedefs saved_macros
 and handle_include file wrapped_incl k =
     let incl = Ast_c.unwrap wrapped_incl.Ast_c.i_include in
     let parsing_style = Includes.get_parsing_style () in
+    let f = Includes.resolve file parsing_style incl in
     if Includes.should_parse parsing_style file incl
     then
-      match Includes.resolve file parsing_style incl with
+      match f with
       | Some header_filename when Common.lfile_exists header_filename ->
-	  (if !Flag_parsing_c.verbose_includes
-	  then pr2 ("including "^header_filename));
-	  let nonlocal =
-	    match incl with Ast_c.NonLocal _ -> true | _ -> false in
-          ignore (k nonlocal header_filename)
+          if not (IC.has_been_parsed header_filename)
+          then
+            begin
+              IC.add_to_parsed_files header_filename;
+              (if !Flag_parsing_c.verbose_includes
+              then pr2 ("including "^header_filename));
+              let nonlocal =
+                match incl with Ast_c.NonLocal _ -> true | _ -> false in
+              let res = k nonlocal header_filename in
+              match res with
+                None -> ()
+              | Some x ->
+                  let pt = x.parse_trees in
+                  let (p, _, _) = pt in
+                  with_program2_unit
+                    (IC.extract_names header_filename)
+                    p
+            end;
+          IC.add_to_dependency_graph file header_filename;
       | _ -> ()
 
 and _parse_print_error_heuristic2bis saved_typedefs saved_macros
-- 
2.21.3

_______________________________________________
Cocci mailing list
Cocci@systeme.lip6.fr
https://systeme.lip6.fr/mailman/listinfo/cocci

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [Cocci] [PATCH v2 3/3] parsing_c: type_annoter_c: Use name cache for type annotation
  2020-09-10  6:17 [Cocci] [PATCH v2 0/3] parsing_c: Optimize recursive header file parsing Jaskaran Singh
  2020-09-10  6:17 ` [Cocci] [PATCH v2 1/3] parsing_c: includes_cache: Implement a name cache Jaskaran Singh
  2020-09-10  6:17 ` [Cocci] [PATCH v2 2/3] parsing_c: parse_c: Build name cache and includes dependency graph Jaskaran Singh
@ 2020-09-10  6:17 ` Jaskaran Singh
  2020-09-10  6:19 ` [Cocci] [PATCH v2 0/3] parsing_c: Optimize recursive header file parsing Jaskaran Singh
  2020-09-15 19:23 ` Julia Lawall
  4 siblings, 0 replies; 6+ messages in thread
From: Jaskaran Singh @ 2020-09-10  6:17 UTC (permalink / raw)
  To: cocci

Use the name cache for type annotation. On encountering the following
which are not stored in the environment, the name cache is looked up and
the relevant header file is parsed for type information:
	- struct field use
	- typedef
	- function call
	- identifier
	- enumeration constant

Signed-off-by: Jaskaran Singh <jaskaran.singh@collabora.com>
---
 parsing_c/type_annoter_c.ml | 134 +++++++++++++++++++++++++++++++-----
 1 file changed, 115 insertions(+), 19 deletions(-)

diff --git a/parsing_c/type_annoter_c.ml b/parsing_c/type_annoter_c.ml
index 25cb6c0ee..49fd060be 100644
--- a/parsing_c/type_annoter_c.ml
+++ b/parsing_c/type_annoter_c.ml
@@ -19,6 +19,7 @@ open Common
 open Ast_c
 
 module Lib = Lib_parsing_c
+module IC = Includes_cache
 
 (*****************************************************************************)
 (* Prelude *)
@@ -186,6 +187,13 @@ type nameenv = {
 
 type environment = nameenv list
 
+let includes_parse_fn file =
+  let choose_includes = Includes.get_parsing_style () in
+  Includes.set_parsing_style Includes.Parse_no_includes;
+  let ret = Parse_c.parse_c_and_cpp false false file in
+  Includes.set_parsing_style choose_includes;
+  List.map fst (fst ret)
+
 (* ------------------------------------------------------------ *)
 (* can be modified by the init_env function below, by
  * the file environment_unix.h
@@ -294,6 +302,39 @@ let member_env_lookup_enum s env =
   | [] -> false
   | env :: _ -> StringMap.mem s env.enum_constant
 
+(* ------------------------------------------------------------ *)
+
+let add_cache_binding_in_scope namedef =
+  let (current, older) = Common.uncons !_scoped_env in
+  let new_frame fr =
+    match namedef with
+      | IC.RetVarOrFunc (s, typ) ->
+	  {fr with
+	   var_or_func = StringMap.add s typ fr.var_or_func}
+      | IC.RetTypeDef   (s, typ) ->
+	  let cv = typ, fr.typedef, fr.level in
+	  let new_typedef_c : typedefs = { defs = StringMap.add s cv fr.typedef.defs } in
+	  {fr with typedef = new_typedef_c}
+      | IC.RetStructUnionNameDef (s, (su, typ)) ->
+	  {fr with
+	   struct_union_name_def = StringMap.add s (su, typ) fr.struct_union_name_def}
+      | IC.RetEnumConstant (s, body) ->
+	  {fr with
+	   enum_constant = StringMap.add s body fr.enum_constant} in
+  (* These are global, so have to reflect them in all the frames. *)
+  _scoped_env := (new_frame current)::(List.map new_frame older)
+
+(* Has side-effects on the environment.
+ * TODO: profile? *)
+let get_type_from_includes_cache file name exp_types on_success on_failure =
+  let file_bindings =
+    IC.get_types_from_name_cache
+      file name exp_types includes_parse_fn in
+  List.iter add_cache_binding_in_scope file_bindings;
+  match file_bindings with
+    [] -> on_failure ()
+  | _ -> on_success ()
+
 
 (*****************************************************************************)
 (* "type-lookup"  *)
@@ -394,7 +435,14 @@ let rec type_unfold_one_step ty env =
 	  then type_unfold_one_step t' env'
           else loop (s::seen) t' env
        with Not_found ->
-          ty
+          let f = Ast_c.file_of_info (Ast_c.info_of_name name) in
+          get_type_from_includes_cache
+            f s [IC.CacheTypedef]
+            (fun () ->
+              let (t', env') = lookup_typedef s !_scoped_env in
+              TypeName (name, Some t') +>
+              Ast_c.rewrap_typeC ty)
+            (fun () -> ty)
       )
 
   | FieldType (t, _, _) -> type_unfold_one_step t env
@@ -474,7 +522,15 @@ let rec typedef_fix ty env =
 	      TypeName (name, Some fixed) +>
 	      Ast_c.rewrap_typeC ty
             with Not_found ->
-              ty))
+              let f = Ast_c.file_of_info (Ast_c.info_of_name name) in
+              get_type_from_includes_cache
+                f s [IC.CacheTypedef]
+                (fun () ->
+                   let (t', env') = lookup_typedef s !_scoped_env in
+                   TypeName (name, Some t') +>
+                   Ast_c.rewrap_typeC ty)
+                (fun () -> ty)
+              ))
 
     | FieldType (t, a, b) ->
 	FieldType (typedef_fix t env, a, b) +> Ast_c.rewrap_typeC ty
@@ -797,8 +853,16 @@ let annotater_expr_visitor_subpart = (fun (k,bigf) expr ->
                     Type_c.noTypeHere
                 )
             | None ->
-                pr2_once ("type_annotater: no type for function ident: " ^ s);
-                Type_c.noTypeHere
+                let f =
+                  Ast_c.file_of_info
+                    (Ast_c.info_of_name ident) in
+                get_type_from_includes_cache
+                  f s [IC.CacheVarFunc]
+                  (fun () ->
+                     match lookup_opt_env lookup_var s with
+                       Some (typ, local) -> make_info_fix (typ, local)
+                     | None -> Type_c.noTypeHere)
+                  (fun () -> Type_c.noTypeHere)
             )
         )
 
@@ -848,22 +912,36 @@ let annotater_expr_visitor_subpart = (fun (k,bigf) expr ->
                 | Some _ ->
                     make_info_def (type_of_s "int")
                 | None ->
-                    if not (s =~ "[A-Z_]+") (* if macro then no warning *)
-                    then
-                      if !Flag_parsing_c.check_annotater then
-                        if not (Hashtbl.mem _notyped_var s)
-                        then begin
-                          pr2 ("Type_annoter: no type found for: " ^ s);
-                          Hashtbl.add _notyped_var s true;
-                        end
-                        else ()
-                      else
-                        pr2 ("Type_annoter: no type found for: " ^ s)
-                    ;
-                    Type_c.noTypeHere
-                )
+                    let f = Ast_c.file_of_info (Ast_c.info_of_name ident) in
+                    let failure_fn =
+                      (fun () ->
+                         if not (s =~ "[A-Z_]+") (* if macro then no warning *)
+                         then
+                           if !Flag_parsing_c.check_annotater
+                           then
+                             if not (Hashtbl.mem _notyped_var s)
+                             then
+                               begin
+                                 pr2
+                                   ("Type_annoter: no type found for: " ^ s);
+                                 Hashtbl.add _notyped_var s true;
+                               end
+                             else ()
+                           else pr2 ("Type_annoter: no type found for: " ^ s);
+                         Type_c.noTypeHere) in
+                    get_type_from_includes_cache
+                      f s [IC.CacheEnumConst;IC.CacheVarFunc]
+                      (fun () ->
+                         match lookup_opt_env lookup_enum s with
+                           Some _ -> make_info_def (type_of_s "int")
+                         | None ->
+                             (match lookup_opt_env lookup_var s with
+                                Some (typ,local) -> make_info_fix (typ,local)
+                              | None -> failure_fn ()))
+                      failure_fn
             )
         )
+    )
 
     (* -------------------------------------------------- *)
     (* C isomorphism on type on array and pointers *)
@@ -954,7 +1032,25 @@ let annotater_expr_visitor_subpart = (fun (k,bigf) expr ->
                         Type_c.noTypeHere
                   )
               | _ ->
-		  Type_c.noTypeHere
+                let s = Ast_c.str_of_name namefld in
+                let f = Ast_c.file_of_info (Ast_c.info_of_name namefld) in
+                let ret_typ =
+                  (match (Ast_c.unwrap (snd t)) with
+                    Ast_c.StructUnionName(su, sname) ->
+                      get_type_from_includes_cache
+                        f s [IC.CacheField sname]
+                        (fun () ->
+                           try
+                             let ((su,fields),ii) =
+                               lookup_structunion (su, sname) !_scoped_env in
+                               try
+                                  make_info_def_fix
+                                    (Type_c.type_field fld (su, fields))
+                                with _ -> Type_c.noTypeHere
+                           with Not_found -> Type_c.noTypeHere)
+                        (fun () -> Type_c.noTypeHere)
+                  | _ -> Type_c.noTypeHere)
+                in ret_typ
           )
         )
 
-- 
2.21.3

_______________________________________________
Cocci mailing list
Cocci@systeme.lip6.fr
https://systeme.lip6.fr/mailman/listinfo/cocci

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [Cocci] [PATCH v2 0/3] parsing_c: Optimize recursive header file parsing
  2020-09-10  6:17 [Cocci] [PATCH v2 0/3] parsing_c: Optimize recursive header file parsing Jaskaran Singh
                   ` (2 preceding siblings ...)
  2020-09-10  6:17 ` [Cocci] [PATCH v2 3/3] parsing_c: type_annoter_c: Use name cache for type annotation Jaskaran Singh
@ 2020-09-10  6:19 ` Jaskaran Singh
  2020-09-15 19:23 ` Julia Lawall
  4 siblings, 0 replies; 6+ messages in thread
From: Jaskaran Singh @ 2020-09-10  6:19 UTC (permalink / raw)
  To: cocci

On Thu, 2020-09-10 at 11:47 +0530, Jaskaran Singh wrote:
> This patch series aims to optimize performance for recursively
> parsing
> header files in Coccinelle.
> 
> Coccinelle's C parsing subsystem has an option called --recursive-
> includes
> to recursively parse header files. This is used for type
> inference/annotation.
> 
> Previously, using --recursive-includes on the entire Linux kernel
> source
> code would take far too long. On my computer with the following
> specs,
> 	- Processor: AMD Ryzen 5 3550H
> 	- RAM: 8 GB
> it would take close to 7 hours to complete.  The optimization that
> this
> patch series implements reduces that time to 1 hour.
> 
> The following is a high-level description of what has been
> implemented:
> - As header files are recursively parsed, they are scanned for the
>   following:
> 	- fields of structs/unions/enums
> 	- typedefs
> 	- function prototypes
> 	- global variables
>   The names of the above are stored in a "name cache", i.e. a
> hashtable to
>   map the name to the files it is declared in.
> - A dependency graph is built to determine dependencies between all
> the
>   files in the codebase.
> - In the type annotation phase of the C subsystem, if a function
> call,
>   struct/union field or identifier is encountered, the type of which
> is
>   not known to the annoter, the name cache is checked for the name.
> - The name cache gives a list of files that the name is
> declared/defined
>   in.  These files are cross checked with the dependency graph to
>   determine if any of these are reachable by the file that the
> annoter is
>   working on.
> - If a reachable header file is found, that file is parsed and the
> type
>   associated to the name is returned.
> 
> Different approaches that were attempted to alleviate this issue, and
> the
> problems with each are as follows:
> - Caching the most recently used files: A LRU cache to store ASTs of
> the
>   most recently encountered header files. The problem with this
> approach
>   is the amount of memory it takes to cache the header file ASTs.
> - Caching the most troublesome files: A pseudo-LFU cache to store
> files
>   that cumulatively take the longest to parse, and thus bloat the
> time
>   taken. The problem with this approach is the amount of memory it
> takes
>   to cache the header file ASTs.
> - Skipping unparsable locations in header files: Skipping top-level
> items
>   in a header file that cannot be parsed. This approach does not
> produce
>   even close to the amount of optimization needed.
> 
> The next step from here would be:
> - Maintain a small but persistent cache of header files in groups of
>   directories. Leverage multiprocessing for parsing these header
> files.
> - Leverage multiprocessing to parse header files initially for name
>   extraction.
> - Performing some initial matching with the semantic patch to
> determine if
>   a C file matches. If matches are found, call the annoter and
> recursively
>   parse header files for type annotation.
> - Recursively parse all header files only once and build a large type
>   environment. Use the dependency graph to determine reachability.
> This
>   has potential memory usage issues though.
> 
> 
> Changes in v2:
> --------------
> - Change occurences of 'begin' and 'match' on the same line with
> something else
>   to the next line for better readability.
> 
> 
>  Makefile                     |    2 
>  parsing_c/includes_cache.ml  |  286
> +++++++++++++++++++++++++++++++++++++++++++
>  parsing_c/includes_cache.mli |   47 +++++++
>  parsing_c/parse_c.ml         |   27 +++-
>  parsing_c/type_annoter_c.ml  |  130 ++++++++++++++++---
>  5 files changed, 466 insertions(+), 26 deletions(-)
> 

Yikes, the diffstat is the old one. Here's the latest (not very
different)

 Makefile                     |    2 
 parsing_c/includes_cache.ml  |  290
+++++++++++++++++++++++++++++++++++++++++++
 parsing_c/includes_cache.mli |   47 ++++++
 parsing_c/parse_c.ml         |   28 +++-
 parsing_c/type_annoter_c.ml  |  134 +++++++++++++++++--
 5 files changed, 475 insertions(+), 26 deletions(-)


> 

_______________________________________________
Cocci mailing list
Cocci@systeme.lip6.fr
https://systeme.lip6.fr/mailman/listinfo/cocci

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Cocci] [PATCH v2 0/3] parsing_c: Optimize recursive header file parsing
  2020-09-10  6:17 [Cocci] [PATCH v2 0/3] parsing_c: Optimize recursive header file parsing Jaskaran Singh
                   ` (3 preceding siblings ...)
  2020-09-10  6:19 ` [Cocci] [PATCH v2 0/3] parsing_c: Optimize recursive header file parsing Jaskaran Singh
@ 2020-09-15 19:23 ` Julia Lawall
  4 siblings, 0 replies; 6+ messages in thread
From: Julia Lawall @ 2020-09-15 19:23 UTC (permalink / raw)
  To: Jaskaran Singh; +Cc: cocci



On Thu, 10 Sep 2020, Jaskaran Singh wrote:

> This patch series aims to optimize performance for recursively parsing
> header files in Coccinelle.
>
> Coccinelle's C parsing subsystem has an option called --recursive-includes
> to recursively parse header files. This is used for type
> inference/annotation.
>
> Previously, using --recursive-includes on the entire Linux kernel source
> code would take far too long. On my computer with the following specs,
> 	- Processor: AMD Ryzen 5 3550H
> 	- RAM: 8 GB
> it would take close to 7 hours to complete.  The optimization that this
> patch series implements reduces that time to 1 hour.
>
> The following is a high-level description of what has been implemented:
> - As header files are recursively parsed, they are scanned for the
>   following:
> 	- fields of structs/unions/enums
> 	- typedefs
> 	- function prototypes
> 	- global variables
>   The names of the above are stored in a "name cache", i.e. a hashtable to
>   map the name to the files it is declared in.
> - A dependency graph is built to determine dependencies between all the
>   files in the codebase.
> - In the type annotation phase of the C subsystem, if a function call,
>   struct/union field or identifier is encountered, the type of which is
>   not known to the annoter, the name cache is checked for the name.
> - The name cache gives a list of files that the name is declared/defined
>   in.  These files are cross checked with the dependency graph to
>   determine if any of these are reachable by the file that the annoter is
>   working on.
> - If a reachable header file is found, that file is parsed and the type
>   associated to the name is returned.
>
> Different approaches that were attempted to alleviate this issue, and the
> problems with each are as follows:
> - Caching the most recently used files: A LRU cache to store ASTs of the
>   most recently encountered header files. The problem with this approach
>   is the amount of memory it takes to cache the header file ASTs.
> - Caching the most troublesome files: A pseudo-LFU cache to store files
>   that cumulatively take the longest to parse, and thus bloat the time
>   taken. The problem with this approach is the amount of memory it takes
>   to cache the header file ASTs.
> - Skipping unparsable locations in header files: Skipping top-level items
>   in a header file that cannot be parsed. This approach does not produce
>   even close to the amount of optimization needed.
>
> The next step from here would be:
> - Maintain a small but persistent cache of header files in groups of
>   directories. Leverage multiprocessing for parsing these header files.
> - Leverage multiprocessing to parse header files initially for name
>   extraction.
> - Performing some initial matching with the semantic patch to determine if
>   a C file matches. If matches are found, call the annoter and recursively
>   parse header files for type annotation.
> - Recursively parse all header files only once and build a large type
>   environment. Use the dependency graph to determine reachability. This
>   has potential memory usage issues though.

Applied.


>
>
> Changes in v2:
> --------------
> - Change occurences of 'begin' and 'match' on the same line with something else
>   to the next line for better readability.
>
>
>  Makefile                     |    2
>  parsing_c/includes_cache.ml  |  286 +++++++++++++++++++++++++++++++++++++++++++
>  parsing_c/includes_cache.mli |   47 +++++++
>  parsing_c/parse_c.ml         |   27 +++-
>  parsing_c/type_annoter_c.ml  |  130 ++++++++++++++++---
>  5 files changed, 466 insertions(+), 26 deletions(-)
>
>
>
_______________________________________________
Cocci mailing list
Cocci@systeme.lip6.fr
https://systeme.lip6.fr/mailman/listinfo/cocci

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2020-09-15 19:24 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-10  6:17 [Cocci] [PATCH v2 0/3] parsing_c: Optimize recursive header file parsing Jaskaran Singh
2020-09-10  6:17 ` [Cocci] [PATCH v2 1/3] parsing_c: includes_cache: Implement a name cache Jaskaran Singh
2020-09-10  6:17 ` [Cocci] [PATCH v2 2/3] parsing_c: parse_c: Build name cache and includes dependency graph Jaskaran Singh
2020-09-10  6:17 ` [Cocci] [PATCH v2 3/3] parsing_c: type_annoter_c: Use name cache for type annotation Jaskaran Singh
2020-09-10  6:19 ` [Cocci] [PATCH v2 0/3] parsing_c: Optimize recursive header file parsing Jaskaran Singh
2020-09-15 19:23 ` Julia Lawall

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).