[PATCH v4 00/14] New hash table implementation

* [PATCH v4 00/14] New hash table implementation
@ 2013-11-07 14:32 Karsten Blees
  2013-11-07 14:33 ` [PATCH v4 01/14] submodule: don't access the .gitmodules cache entry after removing it Karsten Blees
                   ` (13 more replies)
  0 siblings, 14 replies; 26+ messages in thread
From: Karsten Blees @ 2013-11-07 14:32 UTC (permalink / raw)
  To: Git List, Junio C Hamano; +Cc: Thomas Rast, Jens Lehmann, Karsten Blees

Also here:
https://github.com/kblees/git/commits/kb/hashmap-v4

Sorry for the delay, but the promised static use-after-free analysis [1] turned out to be much more involved than I had anticipated.

The Good News is that the static analysis independently found the four use-after-free issues that we already knew about (two fixed in my initial patch 10, and two more found by Thomas' valgrind tests). These are all fixed in this round.

Valgrind tests ran over night and only reported a single unrelated problem (glibc strlen() working in 4-byte chunks and thus exceeding the malloced space in builtin/fetch.c:462).

The Bad News is that I'm a bit clueless regarding unpack-trees.c, so I'd welcome if someone more familiar with that part of git could have a look (see analysis below).

The first 11 patches in this version are exactly as currently in pu, just rebased to next. The problematic memory-leaks patch is at the very end, so we could merge the rest and cook this one a bit longer...

Changes since v3:
- included Jens Lehmann's submodule use-after-free fix [2] (#01)
- added minor fix for 'git update-index --verbose' output (#12)
- added minor cleanup patch for builtin/update-index.c (#13)
- moved the problematic 'fix memory leaks' patch to the end (#14)
- included fix for first valgrind breakage [3] (already in pu) (#14)
- added fix for second valgrind breakage reported in [4] (#14)

[1] http://article.gmane.org/gmane.comp.version-control.git/236467
[2] http://article.gmane.org/gmane.comp.version-control.git/236370
[3] http://article.gmane.org/gmane.comp.version-control.git/236468
[4] http://article.gmane.org/gmane.comp.version-control.git/236869

Jens Lehmann (1):
  submodule: don't access the .gitmodules cache entry after removing it

Karsten Blees (13):
  add a hashtable implementation that supports O(1) removal
  buitin/describe.c: use new hash map implementation
  diffcore-rename.c: move code around to prepare for the next patch
  diffcore-rename.c: simplify finding exact renames
  diffcore-rename.c: use new hash map implementation
  name-hash.c: use new hash map implementation for directories
  name-hash.c: remove unreferenced directory entries
  name-hash.c: use new hash map implementation for cache entries
  name-hash.c: remove cache entries instead of marking them CE_UNHASHED
  remove old hash.[ch] implementation
  fix 'git update-index --verbose --again' output
  builtin/update-index.c: cleanup update_one
  read-cache.c: fix memory leaks caused by removed cache entries

 Documentation/technical/api-hash.txt    |  52 -------
 Documentation/technical/api-hashmap.txt | 237 +++++++++++++++++++++++++++++
 Makefile                                |   5 +-
 builtin/describe.c                      |  53 +++----
 builtin/rm.c                            |   2 +-
 builtin/update-index.c                  |  39 +++--
 cache.h                                 |  18 +--
 diffcore-rename.c                       | 185 ++++++++---------------
 hash.c                                  | 110 --------------
 hash.h                                  |  50 -------
 hashmap.c                               | 212 ++++++++++++++++++++++++++
 hashmap.h                               |  72 +++++++++
 name-hash.c                             | 156 +++++++------------
 read-cache.c                            |  10 +-
 resolve-undo.c                          |   7 +-
 submodule.c                             |  25 +---
 t/t0011-hashmap.sh                      | 236 +++++++++++++++++++++++++++++
 test-hashmap.c                          | 256 ++++++++++++++++++++++++++++++++
 unpack-trees.c                          |   3 +-
 19 files changed, 1199 insertions(+), 529 deletions(-)
 delete mode 100644 Documentation/technical/api-hash.txt
 create mode 100644 Documentation/technical/api-hashmap.txt
 delete mode 100644 hash.c
 delete mode 100644 hash.h
 create mode 100644 hashmap.c
 create mode 100644 hashmap.h
 create mode 100755 t/t0011-hashmap.sh
 create mode 100644 test-hashmap.c

===

Potential use-after-free problems in unpack-trees.c:
----------------------------------------------------

Basic analysis of unpack-trees.c is that most callers go through add_entry or merged_entry, which both add a copy of the cache_entry to a separate index_state (unpack_trees_options.result). This may replace and free entries in unpack_trees_options.result, which is (mostly) ok because unpack_trees_options.result is (mostly) not used otherwise. Adding copies to a separate index is completely safe, so we don't even have to look at most of the other functions in unpack-trees.c.

The two call hierarchies that I'm worried about are as follows:

1.  unpack_callback may pass a cache_entry from unpack_trees_options.src_index
    to unpack_nondirectories (as src[0]), which _may_ add it to
    unpack_trees_options.result without making a copy.
1a. adding a cache_entry from src_index to result without making a copy
    destroys hashtable chaining in src_index (do_add_entry clears ce->next).
1b. if this cache_entry is later replaced and freed in result, it will
    implicitly free an entry that's still in src_index.

  read-cache.c::add_index_entry
    unpack-trees.c::do_add_entry
      unpack-trees.c::unpack_nondirectories
        unpack-trees.c::unpack_callback

2. unpack_trees iterates over unpack_trees_options.result and uses the
   cache_entry after calling verify_absent / apply_sparse_checkout, which in
   turn _may_ replace and free that cache_entry

  read-cache.c::add_index_entry
    unpack-trees.c::do_add_entry
      unpack-trees.c::add_entry
        unpack-trees.c::verify_clean_subdirectory
          unpack-trees.c::check_ok_to_remove
            unpack-trees.c::verify_absent_1
              unpack-trees.c::verify_absent_sparse
                unpack-trees.c::apply_sparse_checkout
                  unpack-trees.c::unpack_trees
              unpack-trees.c::verify_absent
                unpack-trees.c::unpack_trees

Static use-after-free analysis:
-------------------------------

What I did so far is this:

1. Get the reverse call hierarchy for remove_name_hash, up to cmd_* (thanks
   to Eclipse) - results in ~180 functions.
2. Get functions that reference struct cache_entry - results in ~200
   functions.
3. Intersect the two lists - results in 54 functions that _may_ use
   cache_entry after removing / freeing it:

ok builtin/apply.c::add_conflicted_stages_file
ok builtin/apply.c::add_index_file
ok builtin/apply.c::build_fake_ancestor
ok builtin/blame.c::fake_working_tree_commit
ok builtin/checkout.c::checkout_paths
ok builtin/checkout.c::update_some
ok builtin/commit.c::list_paths
ok builtin/ls-files.c::overlay_tree_on_cache
ok builtin/merge.c::suggest_conflicts
ok builtin/reset.c::update_index_from_diff
not-ok builtin/rm.c::cmd_rm (ce->name added to list without strdup)
ok builtin/update-index.c::add_cacheinfo
ok builtin/update-index.c::add_one_path
not-ok builtin/update-index.c::do_reupdate (ce->name passed to update_one,
   which uses it after remove_file_from_cache)
ok builtin/update-index.c::process_directory
ok builtin/update-index.c::process_path
ok builtin/update-index.c::unresolve_one
ok merge-recursive.c::add_cacheinfo
ok read-cache.c::add_index_entry
ok read-cache.c::add_index_entry_with_check
ok read-cache.c::add_to_index
ok read-cache.c::check_file_directory_conflict (called from add_index_entry,
   so ce is not yet in the index)
ok read-cache.c::has_dir_name (called from add_index_entry, so ce is not yet
   in the index)
ok read-cache.c::has_file_name (called from add_index_entry, so ce is not yet
   in the index)
not-ok read-cache.c::read_index_unmerged (uses old ce after replacing
   it with new_ce)
ok read-cache.c::refresh_index
ok read-cache.c::rename_index_entry_at
ok resolve-undo.c::unmerge_index
not-ok resolve-undo.c::unmerge_index_entry_at (uses ce->name after
   remove_index_entry_at)
ok resolve-undo.c::unmerge_marked_index
ok sequencer.c::do_recursive_merge
ok tree.c::read_one_entry_opt
ok tree.c::read_tree
ok unpack-trees.c::add_entry (adds a copy to unpack_trees_options.result)
-- unpack-trees.c::add_same_unmerged
?? unpack-trees.c::apply_sparse_checkout
-- unpack-trees.c::bind_merge
?? unpack-trees.c::check_ok_to_remove
ok unpack-trees.c::check_updates (ok - holds no cache_entry references
   while calling remove_marked_cache_entries)
-- unpack-trees.c::deleted_entry
ok unpack-trees.c::do_add_entry (adds ce to unpack_trees_options.result)
-- unpack-trees.c::keep_entry
ok unpack-trees.c::merged_entry (adds a copy to unpack_trees_options.result)
-- unpack-trees.c::oneway_merge
-- unpack-trees.c::threeway_merge
-- unpack-trees.c::twoway_merge
?? unpack-trees.c::unpack_callback (see unpack_nondirectories)
-- unpack-trees.c::unpack_index_entry
?? unpack-trees.c::unpack_nondirectories (may call do_add_entry without
   copying src[0])
?? unpack-trees.c::unpack_trees (may replace entries in
   unpack_trees_options.result while iterating over it)
?? unpack-trees.c::verify_absent (see unpack_trees)
?? unpack-trees.c::verify_absent_1 (see unpack_trees)
?? unpack-trees.c::verify_absent_sparse (see unpack_trees)
?? unpack-trees.c::verify_clean_subdirectory (see unpack_trees)

^ permalink raw reply	[flat|nested] 26+ messages in thread