All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] git exproll: steps to tackle gc aggression
@ 2013-08-06  2:38 Ramkumar Ramachandra
  2013-08-06 12:24 ` Duy Nguyen
  2013-08-07  0:25 ` Martin Fick
  0 siblings, 2 replies; 33+ messages in thread
From: Ramkumar Ramachandra @ 2013-08-06  2:38 UTC (permalink / raw)
  To: Git List; +Cc: Martin Fick

This is the rough explanation I wrote down after reading it:

So, the problem is that my .git/objects/pack is polluted with little
packs everytime I fetch (or push, if you're the server), and this is
problematic from the perspective of a overtly (naively) aggressive gc
that hammers out all fragmentation.  So, on the first run, the little
packfiles I have are all "consolidated" into big packfiles; you also
write .keep files to say that "don't gc these big packs we just
generated".  In subsequent runs, the little packfiles from the fetch are
absorbed into a pack that is immune to gc.  You're also using a size
heuristic, to consolidate similarly sized packfiles.  You also have a
--ratio to tweak the ratio of sizes.

From: Martin Fick<mfick@codeaurora.org>
See: https://gerrit-review.googlesource.com/#/c/35215/
Thread: http://thread.gmane.org/gmane.comp.version-control.git/231555
(Martin's emails are missing from the archive)
---
 Not a candidate for inclusion, given how difficult it is to convince
 our conservative maintainer to add anything new to the tree.

 I'm posting this in the interest of visibility: I've checked it into
 my repo and started using it. I encourage everyone else to do the
 same, so we can learn from it and discuss how to improve gc.

 contrib/exproll/git-exproll.sh | 566 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 566 insertions(+)
 create mode 100755 contrib/exproll/git-exproll.sh

diff --git a/contrib/exproll/git-exproll.sh b/contrib/exproll/git-exproll.sh
new file mode 100755
index 0000000..9526d9f
--- /dev/null
+++ b/contrib/exproll/git-exproll.sh
@@ -0,0 +1,566 @@
+#!/bin/bash
+# Copyright (c) 2012, Code Aurora Forum. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are
+# met:
+#    # Redistributions of source code must retain the above copyright
+#       notice, this list of conditions and the following disclaimer.
+#    # Redistributions in binary form must reproduce the above
+#       copyright notice, this list of conditions and the following
+#       disclaimer in the documentation and/or other materials provided
+#       with the distribution.
+#    # Neither the name of Code Aurora Forum, Inc. nor the names of its
+#       contributors may be used to endorse or promote products derived
+#       from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED "AS IS" AND ANY EXPRESS OR IMPLIED
+# WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
+# MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT
+# ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS
+# BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+# CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+# SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
+# BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
+# WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE
+# OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN
+# IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+usage() { # error_message
+
+    cat <<-EOF
+		usage: $(basename $0) [-unvt] [--noref] [--nolosse] [-r|--ratio number]
+		                      [git gc option...] git.repo
+
+		-u|-h                usage/help
+		-v verbose
+		-n dry-run           don't actually repack anything
+		-t touch             treat repo as if it had been touched
+		--noref              avoid extra ref packing timestamp checking
+		--noloose            do not run just because there are loose object dirs
+		                     (repacking may still run if they are referenced)
+		-r ratio <number>    packfile ratio to aim for (default 10)
+
+		git gc option        will be passed as args to git gc
+
+		git.repo             to run gc against
+
+		Garbage collect using a pseudo logarithmic packfile maintenance
+		approach.  This approach attempts to minimize packfile churn
+		by keeping several generations of varying sized packfiles around
+		and only consolidating packfiles (or loose objects) which are
+		either new packfiles, or packfiles close to the same size as
+		another packfile.
+
+		An estimate is used to predict when rollups (one consolidation
+		would cause another consolidation) would occur so that this
+		rollup can be done all at once via a single repack.  This reduces
+		both the runtime and the pack file churn in rollup cases.
+
+		Approach: plan each consolidation by creating a table like this:
+
+		Id Keep Size           Sha1(or consolidation list)      Actions(repack down up note)
+		1     - 11356          9052edfb7392646cd4e5f362b953675985f01f96 y - - New
+		2     - 429088         010904d5c11cd26a79fda91b01ab454d1001b402 y - - New
+		c1    - 440444         [1,2]                                    - - -
+
+		Id:    numbers preceded by a c are estimated "c pack" files
+		Keep:  - none, k private keep, o our keep
+		Size:  in disk blocks (default du output)
+		Sha1:  of packfile, or consolidation list of packfile ids
+		Actions
+		repack: - n no, y yes
+		down:   - noop, ^ consolidate with a file above
+		up:     - noop, v consolidate with a file below
+		note:   Human description of script decisions:
+		         New (file is a new packfile)
+		         Consolidate with:<list of packfile ids>
+		         (too far from:<list of packfile ids>)
+
+		On the first pass, always consolidate any new packfiles along
+		with loose objects and along with any packfiles which are within
+		the ratio size of their predecessors (note, the list is ordered
+		by increasing size).  After each consolidation, insert a fake
+		consolidation, or "c pack", to naively represent the size and
+		ordered positioning of the anticipated new consolidated pack.
+		Every time a new pack is planned, rescan the list in case the
+		new "c pack" would cause more consolidation...
+
+		Once the packfiles which need consolidation are determined, the
+		packfiles which will not be consolidated are marked with a .keep
+		file, and those which will be consolidated will have their .keep
+		removed if they have one.  Thus, the packfiles with a .keep will
+		not get repacked.
+
+		Packfile consolidation is determined by the --ratio parameter
+		(default is 10).  This ratio is somewhat of a tradeoff.  The
+		smaller the number, the more packfiles will be kept on average;
+		this increases disk utilization somewhat.  However, a larger
+		ratio causes greater churn and may increase disk utilization due
+		to deleted packfiles not being reclaimed since they may still be
+		kept open by long running applications such as Gerrit.  Sane
+		ratio values are probably between 2 and 10.  Since most
+		consolidations actually end up smaller than the estimated
+		consolidated packfile size (due to compression), the true ratio
+		achieved will likely be 1 to 2 greater than the target ratio.
+		The smaller the target ratio, the greater this discrepancy.
+
+		Finally, attempt to skip garbage collection entirely on untouched
+		repos.  In order to determine if a repo has been touched, use the
+		timestamp on the script's keep files, if any relevant file/dir
+		is newer than a keep marker file, assume that the repo has been
+		touched and gc needs to run.  Also assume gc needs to run whenever
+		there are loose object dirs since they may contain untouched
+		unreferenced loose objects which need to be pruned (once they
+		expire).
+
+		In order to allow the keep files to be an effective timestamp
+		marker to detect relevant changes in a repo since the last run,
+		all relevant files and directories which may be modified during a
+		gc run (even during a noop gc run), must have their timestamps
+		reset to the same time as the keep files or gc will always run
+		even on untouched repos.  The relevant files/dirs are all those
+		files and directories which garbage collection, object packing,
+		ref packing and pruning might change during noop actions.
+EOF
+
+    [ -n "$1" ] && info "ERROR $1"
+
+    exit
+}
+
+debug() { [ -n "$SW_V" ] && info "$1" ; }
+info() { echo "$1" >&2 ; }
+
+array_copy() { #v2 # array_src array_dst
+    local src=$1 dst=$2
+    local s i=0
+    eval s=\${#$src[@]}
+    while [ $i -lt $s ] ; do
+        eval $dst[$i]=\"\${$src[$i]}\"
+        i=$(($i + 1))
+    done
+}
+
+array_equals() { #v2 # array_name [vals...]
+    local a=$1 ; shift
+    local s=0 t=() val
+    array_copy "$a" t
+    for s in "${!t[@]}" ; do s=$((s+1)) ; done
+    [ "$s" -ne "$#" ] && return 1
+    for val in "${t[@]}" ; do
+        [ "$val" = "$1" ] || return 2
+        shift
+    done
+    return 0
+}
+
+packs_sizes() { # git.repo > "size pack"...
+    du -s "$1"/objects/pack/pack-$SHA1.pack | sort -n 2> /dev/null
+}
+
+is_ourkeep() { grep -q "$KEEP" "$1" 2> /dev/null ; } # keep
+has_ourkeep() { is_ourkeep "$(keep_for "$1")" ; } # pack
+has_keep() { [ -f "$(keep_for "$1")" ] ; } # pack
+is_repo() { [ -d "$1/objects" ] && [ -d "$1/refs/heads" ] ; } # git.repo
+
+keep() { # pack   # returns true if we added our keep
+    keep=$(keep_for "$1")
+    [ -f "$keep" ] && return 1
+    echo "$KEEP" > "$keep"
+    return 0
+}
+
+keep_for() { # packfile > keepfile
+    local keep=$(echo "$1" | sed -es'/\.pack$/.keep/')
+    [ "${keep/.keep}" = "$keep" ] && return 1
+    echo "$keep"
+}
+
+idx_for() { # packfile > idxfile
+    local idx=$(echo "$1" | sed -es'/\.pack$/.idx/')
+    [ "${idx/.idx}" = "$idx" ] && return 1
+    echo "$idx"
+}
+
+# pack_or_keep_file > sha
+sha_for() { echo "$1" | sed -es'|\(.*/\)*pack-\([^.]*\)\..*$|\2|' ; }
+
+private_keeps() { # git.repo -> sets pkeeps
+    local repo=$1 ary=$2
+    local keep keeps=("$repo"/objects/pack/pack-$SHA1.keep)
+    pkeeps=()
+    for keep in "${keeps[@]}" ; do
+        is_ourkeep "$keep" || pkeeps=("${pkeeps[@]}" "$keep")
+    done
+}
+
+is_tooclose() { [ "$(($1 * $RATIO))" -gt "$2" ] ; } # smaller larger
+
+unique() { # [args...] > unique_words
+    local lines=$(while [ $# -gt 0 ] ; do echo "$1" ; shift ; done)
+    lines=$(echo "$lines" | sort -u)
+    echo $lines  # as words
+}
+
+outfs() { # fs [args...] > argfs...
+    local fs=$1 ; shift
+    [ $# -gt 0 ] && echo -n "$1" ; shift
+    while [ $# -gt 0 ] ; do echo -n "$fs$1" ; shift ; done
+}
+
+sort_list() { # < list > formatted_list
+    # n has_keep size sha repack down up note
+    awk '{ note=$8; for(i=8;i<NF;i++) note=note " "$(i+1)
+           printf("%-5s %s %-14s %-40s %s %s %s %s\n", \
+                     $1,$2,   $3,  $4, $5,$6,$7,note)}' |\
+        sort -k 3,3n -k 1,1n
+}
+
+is_touched() { # git.repo
+    local repo=$1
+    local loose keep ours newer
+    [ -n "$SW_T" ] && { debug "$SW_T -> treat as touched" ; return 0 ; }
+
+    if [ -z "$SW_LOOSE" ] ; then
+        # If there are loose objects, they may need to be pruned,
+        # run even if nothing has really been touched.
+        loose=$(find "$repo/objects" -type d \
+                      -wholename "$repo/objects/[0-9][0-9]"
+                      -print -quit 2>/dev/null)
+        [ -n "$loose" ] && { info "There are loose object directories" ; return 0 ; }
+    fi
+
+    # If we don't have a keep, the current packfiles may not have been
+    # compressed with the current gc policy (gc may never have been run),
+    # so run at least once to repack everything.  Also, we need a marker
+    # file for timestamp tracking (a dir needs to detect changes within
+    # it, so it cannot be a marker) and our keeps are something we control,
+    # use them.
+    for keep in "$repo"/objects/pack/pack-$SHA1.keep ; do
+        is_ourkeep "$keep" && { ours=$keep ; break ; }
+    done
+    [ -z "$ours" ] && { info 'We have no keep (we have never run?): run' ; return 0 ; }
+
+    debug "Our timestamp keep: $ours"
+    # The wholename stuff seems to get touched by a noop git gc
+    newer=$(find "$repo/objects" "$repo/refs" "$repo/packed-refs" \
+                  '!' -wholename "$repo/objects/info" \
+                  '!' -wholename "$repo/objects/info/*" \
+                  -newer "$ours" \
+                  -print -quit 2>/dev/null)
+    [ -z "$newer" ] && return 1
+
+    info "Touched since last run: $newer"
+    return 0
+}
+
+touch_refs() { # git.repo start_date refs
+    local repo=$1 start_date=$2 refs=$3
+    (
+        debug "Setting start date($start_date) on unpacked refs:"
+        debug "$refs"
+        cd "$repo/refs" || return
+        # safe to assume no newlines in a ref name
+        echo "$refs" | xargs -d '\n' -n 1 touch -c -d "$start_date"
+    )
+}
+
+set_start_date() { # git.repo start_date refs refdirs packedrefs [packs]
+    local repo=$1 start_date=$2 refs=$3 refdirs=$4 packedrefs=$5 ; shift 5
+    local pack keep idx repacked
+
+    # This stuff is touched during object packs
+    while [ $# -gt 0 ] ; do
+        pack=$1 ; shift
+        keep="$(keep_for "$pack")"
+        idx="$(idx_for "$pack")"
+        touch -c -d "$start_date" "$pack" "$keep" "$idx"
+        debug "Setting start date on: $pack $keep $idx"
+    done
+    # This will prevent us from detecting any deletes in the pack dir
+    # since gc ran, except for private keeps which we are checking
+    # manually.  But there really shouldn't be any other relevant deletes
+    # in this dir which should cause us to rerun next time, deleting a
+    # pack or index file by anything but gc would be bad!
+    debug "Setting start date on pack dir: $start_date"
+    touch -c -d "$start_date" "$repo/objects/pack"
+
+
+    if [ -z "$SW_REFS" ] ; then
+        repacked=$(find "$repo/packed-refs" -newer "$repo/objects/pack"
+                      -print -quit 2>/dev/null)
+        if [ -n "$repacked" ] ; then
+            # The ref dirs and packed-ref files seem to get touched even on
+            # a noop refpacking
+            debug "Setting start date on packed-refs"
+            touch -c -d "$start_date" "$repo/packed-refs"
+            touch_refs "$repo" "$start_date" "$refdirs"
+
+            # A ref repack does not imply a ref change, but since it is
+            # hard to tell, simply assume so
+            if [ "$refs" != "$(cd "$repo/refs" ; find -depth)" ] || \
+               [ "$packedrefs" != "$(<"$repo/packed-refs")" ] ; then
+                # We retouch if needed (instead of simply checking then
+                # touching) to avoid a race between the check and the set.
+                debug "  but refs actually got packed, so retouch packed-refs"
+                touch -c "$repo/packed-refs"
+            fi
+        fi
+    fi
+}
+
+note_consolidate() { # note entry > note (no duplicated consolidated entries)
+    local note=$1 entry=$2
+    local entries=() ifs=$IFS
+    if  echo "$note" | grep -q 'Consolidate with:[0-9,c]' ; then
+        IFS=,
+        entries=( $(echo "$note" | sed -es'/^.*Consolidate with:\([0-9,c]*\).*$/\1/') )
+        note=( $(echo "$note" | sed -es'/Consolidate with:[0-9,c]*//') )
+        IFS=$ifs
+    fi
+    entries=( $(unique "${entries[@]}" "$entry") )
+    echo "$note Consolidate with:$(outfs , "${entries[@]}")"
+}
+
+note_toofar() { # note entry > note (no duplicated "too far" entries)
+    local note=$1 entry=$2
+    local entries=() ifs=$IFS
+    if  echo "$note" | grep -q '(too far from:[0-9,c]*)' ; then
+        IFS=,
+        entries=( $(echo "$note" | sed -es'/^.*(too far from:\([0-9,c]*\)).*$/\1/') )
+        note=( $(echo "$note" | sed -es'/(too far from:[0-9,c]*)//') )
+        IFS=$ifs
+    fi
+    entries=( $(unique "${entries[@]}" "$entry") )
+    echo "$note (too far from:$(outfs , "${entries[@]}"))"
+}
+
+last_entry() { # isRepack pline repackline > last_rows_entry
+    local size_hit=$1 pline=$2 repackline=$3
+    if [ -n "$pline" ] ; then
+        if [ -n "$size_hit" ] ; then
+            echo "$repack_line"
+        else
+            echo "$pline"
+        fi
+    fi
+}
+
+init_list() { # git.repo > shortlist
+    local repo=$1
+    local file
+    local n has_keep size sha repack
+
+    packs_sizes "$1" | {
+        while read size file ; do
+            n=$((n+1))
+            repack=n
+            has_keep=-
+            if has_keep "$file" ; then
+                has_keep=k
+                has_ourkeep "$file" && has_keep=o
+            fi
+            sha=$(sha_for "$file")
+            echo "$n $has_keep $size $sha $repack"
+        done
+    } | sort_list
+}
+
+consolidate_list() { # run < list > list
+    local run=$1
+    local sum=0 psize=0 sum_size=0 size_hit pn clist pline repackline
+    local n has_keep size sha repack down up note
+
+    {
+        while read n has_keep size sha repack down up note; do
+            [ -z "$up" ] && up='-'
+            [ -z "$down" ] && down="-"
+
+            if [ "$has_keep" = "k" ] ; then
+                echo "$n $has_keep $size $sha $repack - - Private"
+                continue
+            fi
+
+            if [ "$repack" = "n" ] ; then
+                if is_tooclose $psize $size ; then
+                    size_hit=y
+                    repack=y
+                    sum=$(($sum + $sum_size + $size))
+                    sum_size=0 # Prevents double summing this entry
+                    clist=($(unique "${clist[@]}" $pn $n))
+                    down="^"
+                    [ "$has_keep" = "-" ] && note="$note New +"
+                    note=$(note_consolidate "$note" "$pn")
+                elif [ "$has_keep" = "-" ] ; then
+                    repack=y
+                    sum=$(($sum + $size))
+                    sum_size=0 # Prevents double summing this entry
+                    clist=($(unique "${clist[@]}" $n))
+                    note="$note New"
+                elif [ $psize -ne 0 ] ; then
+                    sum_size=$size
+                    down="!"
+                    note=$(note_toofar "$note" "$pn")
+                else
+                    sum_size=$size
+                fi
+            else
+                sum_size=$size
+            fi
+
+            # By preventing "c files" (consolidated) from being marked
+            # "repack" they won't get keeps
+            repack2=y
+            [ "${n/c}" != "$n" ] && { repack=- ; repack2=- ; }
+
+            last_entry "$size_hit" "$pline" "$repack_line"
+            # Delay the printout until we know whether we are
+            # being consolidated with the entry following us
+            # (we won't know until the next iteration).
+            # size_hit is used to determine which of the lines
+            # below will actually get printed above on the next
+            # iteration.
+            pline="$n $has_keep $size $sha $repack $down $up $note"
+            repack_line="$n $has_keep $size $sha $repack2 $down v $note"
+
+            pn=$n ; psize=$size # previous entry data
+            size_hit='' # will not be consolidated up
+
+        done
+        last_entry "$size_hit" "$pline" "$repack_line"
+
+        [ $sum -gt 0 ] && echo "c$run - $sum [$(outfs , "${clist[@]}")] - - -"
+
+    } | sort_list
+}
+
+process_list() { # git.repo > list
+    local list=$(init_list "$1")  plist run=0
+
+    while true ; do
+        plist=$list
+        run=$((run +1))
+        list=$(echo "$list" | consolidate_list "$run")
+        if [ "$plist" != "$list" ] ; then
+            debug "------------------------------------------------------------------------------------"
+            debug "$HEADER"
+            debug "$list"
+        else
+            break
+        fi
+    done
+    debug "------------------------------------------------------------------------------------"
+    echo "$list"
+}
+
+repack_list() { # git.repo < list
+    local repo=$1
+    local start_date newpacks=0 pkeeps keeps=1 refs refdirs rtn
+    local packedrefs=$(<"$repo/packed-refs")
+
+    # so they don't appear touched after a noop refpacking
+    if [ -z "$SW_REFS" ] ; then
+        refs=$(cd "$repo/refs" ; find -depth)
+        refdirs=$(cd "$repo/refs" ; find -type d -depth)
+        debug "Before refs:"
+        debug "$refs"
+    fi
+
+    # Find a private keep snapshot which has not changed from
+    # before our start_date so private keep deletions during gc
+    # can be detected
+    while ! array_equals pkeeps "${keeps[@]}" ; do
+       debug "Getting a private keep snapshot"
+       private_keeps "$repo"
+       keeps=("${pkeeps[@]}")
+       debug "before keeps: ${keeps[*]}"
+       start_date=$(date)
+       private_keeps "$repo"
+       debug "after keeps: ${pkeeps[*]}"
+    done
+
+    while read n has_keep size sha repack down up note; do
+        if [ "$repack" = "y" ] ; then
+            keep="$repo/objects/pack/pack-$sha.keep"
+            info "Repacking $repo/objects/pack/pack-$sha.pack"
+            [ -f "$keep" ] && rm -f "$keep"
+        fi
+    done
+
+    ( cd "$repo" && git gc "${GC_OPTS[@]}" ) ; rtn=$?
+
+    # Mark any files withoug a .keep with our .keep
+    packs=("$repo"/objects/pack/pack-$SHA1.pack)
+    for pack in "${packs[@]}" ; do
+        if keep "$pack" ; then
+            info "New pack: $pack"
+            newpacks=$((newpacks+1))
+        fi
+    done
+
+    # Record start_time.  If there is more than 1 new packfile, we
+    # don't want to risk touching it with an older date since that
+    # would prevent consolidation on the next run.  If the private
+    # keeps have changed, then we should run next time no matter what.
+    if [ $newpacks -le 1 ] || ! array_equals pkeeps "${keeps[@]}" ; then
+        set_start_date "$repo" "$start_date" "$refs" "$refdirs" "$packedrefs" "${packs[@]}"
+    fi
+
+    return $rtn # we really only care about the gc error code
+}
+
+git_gc() { # git.repo
+    local list=$(process_list "$1")
+    if [ -z "$SW_V" ] ; then
+        info "Running $PROG on $1.  git gc options: ${GC_OPTS[@]}"
+        echo "$HEADER" >&2
+        echo "$list" >&2 ;
+    fi
+    echo "$list" | repack_list "$1"
+}
+
+
+PROG=$(basename "$0")
+HEADER="Id Keep Size           Sha1(or consolidation list)      Actions(repack down up note)"
+KEEP=git-exproll
+HEX='[0-9a-f]'
+HEX10=$HEX$HEX$HEX$HEX$HEX$HEX$HEX$HEX$HEX$HEX
+SHA1=$HEX10$HEX10$HEX10$HEX10
+
+RATIO=10
+SW_N='' ; SW_V='' ; SW_T='' ; SW_REFS='' ; SW_LOOSE='' ; GC_OPTS=()
+while [ $# -gt 0 ] ; do
+    case "$1" in
+        -u|-h)  usage ;;
+        -n)  SW_N="$1" ;;
+        -v)  SW_V="$1" ;;
+
+        -t)  SW_T="$1" ;;
+        --norefs)  SW_REFS="$1" ;;
+        --noloose) SW_LOOSE="$1" ;;
+
+        -r|--ratio)  shift ; RATIO="$1" ;;
+
+        *)  [ $# -le 1 ] && break
+            GC_OPTS=( "${GC_OPTS[@]}" "$1" )
+            ;;
+    esac
+    shift
+done
+
+
+REPO="$1"
+if ! is_repo "$REPO" ; then
+    REPO=$REPO/.git
+    is_repo "$REPO" || usage "($1) is not likely a git repo"
+fi
+
+
+if [ -z "$SW_N" ] ; then
+    is_touched "$REPO" || { info "Repo untouched since last run" ; exit ; }
+    git_gc "$REPO"
+else
+    is_touched "$REPO" || info "Repo untouched since last run, analyze anyway."
+    process_list "$REPO" >&2
+fi
-- 
1.8.4.rc1.4.gd6cbf2f

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH] git exproll: steps to tackle gc aggression
  2013-08-06  2:38 [PATCH] git exproll: steps to tackle gc aggression Ramkumar Ramachandra
@ 2013-08-06 12:24 ` Duy Nguyen
  2013-08-06 17:39   ` Junio C Hamano
  2013-08-07  0:10   ` Martin Fick
  2013-08-07  0:25 ` Martin Fick
  1 sibling, 2 replies; 33+ messages in thread
From: Duy Nguyen @ 2013-08-06 12:24 UTC (permalink / raw)
  To: Martin Fick; +Cc: Git List, Ramkumar Ramachandra

On Tue, Aug 6, 2013 at 9:38 AM, Ramkumar Ramachandra <artagnon@gmail.com> wrote:
> +               Garbage collect using a pseudo logarithmic packfile maintenance
> +               approach.  This approach attempts to minimize packfile churn
> +               by keeping several generations of varying sized packfiles around
> +               and only consolidating packfiles (or loose objects) which are
> +               either new packfiles, or packfiles close to the same size as
> +               another packfile.

I wonder if a simpler approach may be nearly efficient as this one:
keep the largest pack out, repack the rest at fetch/push time so there
are at most 2 packs at a time. Or we we could do the repack at 'gc
--auto' time, but with lower pack threshold (about 10 or so). When the
second pack is as big as, say half the size of the first, merge them
into one at "gc --auto" time. This can be easily implemented in
git-repack.sh.
-- 
Duy

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] git exproll: steps to tackle gc aggression
  2013-08-06 12:24 ` Duy Nguyen
@ 2013-08-06 17:39   ` Junio C Hamano
  2013-08-07  4:43     ` Ramkumar Ramachandra
  2013-08-07  0:10   ` Martin Fick
  1 sibling, 1 reply; 33+ messages in thread
From: Junio C Hamano @ 2013-08-06 17:39 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Martin Fick, Git List, Ramkumar Ramachandra

Duy Nguyen <pclouds@gmail.com> writes:

> On Tue, Aug 6, 2013 at 9:38 AM, Ramkumar Ramachandra <artagnon@gmail.com> wrote:
>> +               Garbage collect using a pseudo logarithmic packfile maintenance
>> +               approach.  This approach attempts to minimize packfile churn
>> +               by keeping several generations of varying sized packfiles around
>> +               and only consolidating packfiles (or loose objects) which are
>> +               either new packfiles, or packfiles close to the same size as
>> +               another packfile.
>
> I wonder if a simpler approach may be nearly efficient as this one:
> keep the largest pack out, repack the rest at fetch/push time so there
> are at most 2 packs at a time. Or we we could do the repack at 'gc
> --auto' time, but with lower pack threshold (about 10 or so). When the
> second pack is as big as, say half the size of the first, merge them
> into one at "gc --auto" time. This can be easily implemented in
> git-repack.sh.

Another random thought.

Imagine we have a cheap way to enumerate the young objects without
the usual history traversal.  For example, list of all loose objects
and what appears in the .idx files that are young.

We can reconstruct "names" for trees and blobs from such a list of
object names; if a commit in the list refers to a tree, that tree is
the top level, and a blob or a tree that appears in such a top-level
tree can be given a "name" for its place in the tree (recursively).
I suspect we would end up giving names to large majority of trees
and blobs in such a list by doing so.

If these suspicions turn out to be true, then we could:

 - run that enumeration algorithm to come up with a set of object
   names;

 - emit the tag objects in that set in the tagger timestamp order;

 - emit the commit objects in that set in the commit timestamp
   order, while noting the tree objects contained in the set, giving
   them name "";

 - "traverse" the trees and blobs in that set, giving the found ones
   names (do so without stepping outside the set);

 - emit the trees and blobs with their names.  Some objects may not
   have given any name, but that is OK as long as they are in the
   minority.

And feeding it to pack-objects to produce a single pack, and then
prune away the source of these young objects in the end.

The above could turn out to be much cheaper than the traditional
history traversal.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] git exproll: steps to tackle gc aggression
  2013-08-06 12:24 ` Duy Nguyen
  2013-08-06 17:39   ` Junio C Hamano
@ 2013-08-07  0:10   ` Martin Fick
  2013-08-08  2:18     ` Duy Nguyen
  1 sibling, 1 reply; 33+ messages in thread
From: Martin Fick @ 2013-08-07  0:10 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Git List, Ramkumar Ramachandra

On Tuesday, August 06, 2013 06:24:50 am Duy Nguyen wrote:
> On Tue, Aug 6, 2013 at 9:38 AM, Ramkumar Ramachandra 
<artagnon@gmail.com> wrote:
> > +               Garbage collect using a pseudo
> > logarithmic packfile maintenance +              
> > approach.  This approach attempts to minimize packfile
> > churn +               by keeping several generations
> > of varying sized packfiles around +               and
> > only consolidating packfiles (or loose objects) which
> > are +               either new packfiles, or packfiles
> > close to the same size as +               another
> > packfile.
> 
> I wonder if a simpler approach may be nearly efficient as
> this one: keep the largest pack out, repack the rest at
> fetch/push time so there are at most 2 packs at a time.
> Or we we could do the repack at 'gc --auto' time, but
> with lower pack threshold (about 10 or so). When the
> second pack is as big as, say half the size of the
> first, merge them into one at "gc --auto" time. This can
> be easily implemented in git-repack.sh.

It would definitely be better than the current gc approach.  

However, I suspect it is still at least one to two orders of 
magnitude off from where it should be.  To give you a real 
world example, on our server today when gitexproll ran on 
our kernel/msm repo, it consolidated 317 pack files into one 
almost 8M packfile (it compresses/dedupes shockingly well, 
one of those new packs was 33M).  Our largest packfile in 
that repo is 1.5G!  

So let's now imagine that the second closest packfile is 
only 100M, it would keep getting consolidated with 8M worth 
of data every day (assuming the same conditions and no extra 
compression).  That would take (750M-100M)/8M ~ 81 days to 
finally build up large enough to no longer consolidate the 
new packs with the second largest pack file daily.  During 
those 80+ days, it will be on average writing 325M too much 
per day (when it should be writing just 8M).

So I can see the appeal of a simple solution, unfortunately 
I think one layer would still "suck" though.  And if you are 
going to add even just one extra layer, I suspect that you 
might as well go the full distance since you probably 
already need to implement the logic to do so?

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation
 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] git exproll: steps to tackle gc aggression
  2013-08-06  2:38 [PATCH] git exproll: steps to tackle gc aggression Ramkumar Ramachandra
  2013-08-06 12:24 ` Duy Nguyen
@ 2013-08-07  0:25 ` Martin Fick
  2013-08-07  4:36   ` Ramkumar Ramachandra
  1 sibling, 1 reply; 33+ messages in thread
From: Martin Fick @ 2013-08-07  0:25 UTC (permalink / raw)
  To: Ramkumar Ramachandra; +Cc: Git List

On Monday, August 05, 2013 08:38:47 pm Ramkumar Ramachandra 
wrote:
> This is the rough explanation I wrote down after reading
> it:
> 
> So, the problem is that my .git/objects/pack is polluted
> with little packs everytime I fetch (or push, if you're
> the server), and this is problematic from the
> perspective of a overtly (naively) aggressive gc that
> hammers out all fragmentation.  So, on the first run,
> the little packfiles I have are all "consolidated" into
> big packfiles; you also write .keep files to say that
> "don't gc these big packs we just generated".  In
> subsequent runs, the little packfiles from the fetch are
> absorbed into a pack that is immune to gc.  You're also
> using a size heuristic, to consolidate similarly sized
> packfiles.  You also have a --ratio to tweak the ratio
> of sizes.
> 
> From: Martin Fick<mfick@codeaurora.org>
> See: https://gerrit-review.googlesource.com/#/c/35215/
> Thread:
> http://thread.gmane.org/gmane.comp.version-control.git/2
> 31555 (Martin's emails are missing from the archive)
> ---

After analyzing today's data, I recognize that in some 
circumstances the size estimation after consolidation can be 
off by huge amounts.  The script naively just adds the 
current sizes together.  This gives a very rough estimate, 
of the new packfile size, but sometimes it can be off by 
over 2 orders of magnitude. :(  While many new packfiles are 
tiny (several K only), it seems like the larger new 
packfiles have a terrible tendency to throw the estimate way 
off (I suspect they simply have many duplicate objects).  
But despite this poor estimate, the script still offers 
drastic improvements over plain git gc.

So, it has me wondering if there isn't a more accurate way 
to estimate the new packfile without wasting a ton of time?

If not, one approach which might be worth experimenting with 
is to just assume that new packfiles have size 0!  Then just 
consolidate them with any other packfile which is ready for 
consolidation, or if none are ready, with the smallest 
packfile.  I would not be surprised to see this work on 
average better than the current summation,

-Martin


-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation
 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] git exproll: steps to tackle gc aggression
  2013-08-07  0:25 ` Martin Fick
@ 2013-08-07  4:36   ` Ramkumar Ramachandra
  0 siblings, 0 replies; 33+ messages in thread
From: Ramkumar Ramachandra @ 2013-08-07  4:36 UTC (permalink / raw)
  To: Martin Fick; +Cc: Git List

Martin Fick wrote:
> So, it has me wondering if there isn't a more accurate way
> to estimate the new packfile without wasting a ton of time?

I'm not sure there is. Adding the sizes of individual packs can be off
by a lot, because your deltification will be more effective if you
have more data to slide windows over and compress. For the purposes of
illustration, take a simple example:

packfile-1 has a 30M Makefile and several tiny deltas. Total = 40M.
packfile-2 has a 31M Makefile.um and several tiny deltas. Total = 40M.

Now, what is the size of packfile-3 which contains the contents of
both packfile-1 and packfile-2? 80M is a bad estimate, because you can
store deltas against just one Makefile.

So, unless you do an in-depth analysis of the objects in the packfiles
(which can be terribly expensive), I don't see how you can arrive at a
better estimate.

> If not, one approach which might be worth experimenting with
> is to just assume that new packfiles have size 0!  Then just
> consolidate them with any other packfile which is ready for
> consolidation, or if none are ready, with the smallest
> packfile.  I would not be surprised to see this work on
> average better than the current summation,

That is assuming that all fetches (or pushes) are small, which is
probably a good rule; you might like to have a "smallness threshold",
although I haven't thought hard about the problem.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] git exproll: steps to tackle gc aggression
  2013-08-06 17:39   ` Junio C Hamano
@ 2013-08-07  4:43     ` Ramkumar Ramachandra
  2013-08-08  7:13       ` Junio C Hamano
  0 siblings, 1 reply; 33+ messages in thread
From: Ramkumar Ramachandra @ 2013-08-07  4:43 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Duy Nguyen, Martin Fick, Git List

Junio C Hamano wrote:
> Imagine we have a cheap way to enumerate the young objects without
> the usual history traversal.

Before we discuss the advantages, can you outline how we can possibly
get this data without actually walking downwards from the roots
(refs)? One way to do it is to pull data out of a log of ref updates
(aka. reflog), but we both know how unreliable that can be.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] git exproll: steps to tackle gc aggression
  2013-08-07  0:10   ` Martin Fick
@ 2013-08-08  2:18     ` Duy Nguyen
  0 siblings, 0 replies; 33+ messages in thread
From: Duy Nguyen @ 2013-08-08  2:18 UTC (permalink / raw)
  To: Martin Fick; +Cc: Git List, Ramkumar Ramachandra

On Wed, Aug 7, 2013 at 7:10 AM, Martin Fick <mfick@codeaurora.org> wrote:
>> I wonder if a simpler approach may be nearly efficient as
>> this one: keep the largest pack out, repack the rest at
>> fetch/push time so there are at most 2 packs at a time.
>> Or we we could do the repack at 'gc --auto' time, but
>> with lower pack threshold (about 10 or so). When the
>> second pack is as big as, say half the size of the
>> first, merge them into one at "gc --auto" time. This can
>> be easily implemented in git-repack.sh.
>
> It would definitely be better than the current gc approach.
>
> However, I suspect it is still at least one to two orders of
> magnitude off from where it should be.  To give you a real
> world example, on our server today when gitexproll ran on
> our kernel/msm repo, it consolidated 317 pack files into one
> almost 8M packfile (it compresses/dedupes shockingly well,
> one of those new packs was 33M).  Our largest packfile in
> that repo is 1.5G!
>
> So let's now imagine that the second closest packfile is
> only 100M, it would keep getting consolidated with 8M worth
> of data every day (assuming the same conditions and no extra
> compression).  That would take (750M-100M)/8M ~ 81 days to
> finally build up large enough to no longer consolidate the
> new packs with the second largest pack file daily.  During
> those 80+ days, it will be on average writing 325M too much
> per day (when it should be writing just 8M).
>
> So I can see the appeal of a simple solution, unfortunately
> I think one layer would still "suck" though.  And if you are
> going to add even just one extra layer, I suspect that you
> might as well go the full distance since you probably
> already need to implement the logic to do so?

I see. It looks like your way is the best way to go.
-- 
Duy

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] git exproll: steps to tackle gc aggression
  2013-08-07  4:43     ` Ramkumar Ramachandra
@ 2013-08-08  7:13       ` Junio C Hamano
  2013-08-08  7:44         ` Ramkumar Ramachandra
  0 siblings, 1 reply; 33+ messages in thread
From: Junio C Hamano @ 2013-08-08  7:13 UTC (permalink / raw)
  To: Ramkumar Ramachandra; +Cc: Duy Nguyen, Martin Fick, Git List

Ramkumar Ramachandra <artagnon@gmail.com> writes:

> Junio C Hamano wrote:
>> Imagine we have a cheap way to enumerate the young objects without
>> the usual history traversal.
>
> Before we discuss the advantages, can you outline how we can possibly
> get this data without actually walking downwards from the roots
> (refs)? One way to do it is to pull data out of a log of ref updates
> (aka. reflog), but we both know how unreliable that can be.

My understanding of the topic is to come up with a way that is much
cheaper than the current "gc --auto" that involves recent history
walk to consolidate both loose objects and small young packs into
one, so that we can use that logic for "gc --auto".

The key phrase is "without the usual history traversal".  We are
talking about young objects, and they are likely to be reachable
from something (like reflog entries, if not refs).  We may include
unreachable cruft in the result in the "let's be quick and collect
them into a single young pack", and you will need to keep them while
reflog entries are alive, and you will need periodic sweeps with the
usual history walking to remove older crufts that recently have
become unreachable due to reflog expiry from packs anyway, so it is
not a problem for the pack that consolidates young objects into a
single pack to contain some unreachable crufts.

If you start from that assumption [*1*], the way to enumerate the
young objects without the usual history traversal should be fairly
obvious.

By definition, loose objects are all young because they were created
since the last "gc --auto".  Also pack .idx files know their own
creation timestamp to let you decide how old they are, you can see
how many objects there are in the corresponding .pack and how big it
is.

By doing an equivalent of "find .git/objects/[0-9a-f][0-9a-f]/", you
can enumerate the loose objects, and an equivalent of "show-ref"
will enumerate the objects in the pack that the .idx file you
determined to be small and young.

Note that *1* is an assumption. I do not know offhand if such a
"consolidate young objects quickly into one to keep the number of
packs small" strategy is an overall win.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] git exproll: steps to tackle gc aggression
  2013-08-08  7:13       ` Junio C Hamano
@ 2013-08-08  7:44         ` Ramkumar Ramachandra
  2013-08-08 16:56           ` Junio C Hamano
  0 siblings, 1 reply; 33+ messages in thread
From: Ramkumar Ramachandra @ 2013-08-08  7:44 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Duy Nguyen, Martin Fick, Git List

Junio C Hamano wrote:
> it is
> not a problem for the pack that consolidates young objects into a
> single pack to contain some unreachable crufts.

So far, we have never considered putting unreachable objects in packs.
Let me ask the obvious question first: what happens when I push? Do I
pack up all the loose objects quickly (without bothering about
reachability) and send unreachable cruft to the server? Who is
ultimately going to clean up this cruft, and when?

> Note that *1* is an assumption. I do not know offhand if such a
> "consolidate young objects quickly into one to keep the number of
> packs small" strategy is an overall win.

The way I see it, you're just delaying the reachability analysis
beyond the usual pack-time. The whole point of separating out loose
objects from packfiles was to not make the user wait everytime she
does common repository operations locally: delay the reachability
analysis and compression (aka. packing) until a network operation
kicks in.

To see if introducing another stage is an overall win, think about
what the bottlenecks are: in network operations, it's always the data
being sent over. If I understand correctly, during a network
transaction, there's very minimal computation where the server-client
just quickly tell each other where their refs are: therefore, it's
trivial to figure out what's missing and pack that up to send it over.
The strategy of including unreachable cruft in these packs might make
sense from the point of view of a local gc, but will ultimately slow
down network operations, no?

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] git exproll: steps to tackle gc aggression
  2013-08-08  7:44         ` Ramkumar Ramachandra
@ 2013-08-08 16:56           ` Junio C Hamano
  2013-08-08 17:34             ` Martin Fick
  2013-08-08 17:36             ` Ramkumar Ramachandra
  0 siblings, 2 replies; 33+ messages in thread
From: Junio C Hamano @ 2013-08-08 16:56 UTC (permalink / raw)
  To: Ramkumar Ramachandra; +Cc: Duy Nguyen, Martin Fick, Git List

Ramkumar Ramachandra <artagnon@gmail.com> writes:

> Junio C Hamano wrote:
>> it is
>> not a problem for the pack that consolidates young objects into a
>> single pack to contain some unreachable crufts.
>
> So far, we have never considered putting unreachable objects in packs.
> Let me ask the obvious question first: what happens when I push? Do I
> pack up all the loose objects quickly (without bothering about
> reachability) and send unreachable cruft to the server?

No.

I thought the discussion was about making the local gc cheaper, and
the "Imagine we have a cheap way" was to address it by assuming that
the daily "pack young objects into a single pack" can be sped up if
we did not have to traverse history.  More permanent packs (the
older ones in "set of packs staggered by age" Martin proposes) in
the repository should go through the normal history traversal route.

And of course we do not transfer objects that are not asked for from
or to a repository over pack tranfer.

Most importantly, it is not about butchering the pack machinery in
such a way that we can create _only_ such "non history traversal"
packs.

So I do not see how that question is "obvious".  The question
obviously pointless and misses the mark by wide margin?  The
question makes it obvious that whoever asks it does not understand
how Git works?

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] git exproll: steps to tackle gc aggression
  2013-08-08 16:56           ` Junio C Hamano
@ 2013-08-08 17:34             ` Martin Fick
  2013-08-08 18:52               ` Junio C Hamano
  2013-08-08 17:36             ` Ramkumar Ramachandra
  1 sibling, 1 reply; 33+ messages in thread
From: Martin Fick @ 2013-08-08 17:34 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Ramkumar Ramachandra, Duy Nguyen, Git List

On Thursday, August 08, 2013 10:56:38 am Junio C Hamano 
wrote:
> I thought the discussion was about making the local gc
> cheaper, and the "Imagine we have a cheap way" was to
> address it by assuming that the daily "pack young
> objects into a single pack" can be sped up if we did not
> have to traverse history.  More permanent packs (the
> older ones in "set of packs staggered by age" Martin
> proposes) in the repository should go through the normal
> history traversal route.

Assuming I understand what you are suggesting, would these 
"young object" likely still get "deduped" in an efficient 
way without doing history traversal (it sounds like they 
would)?  In other words, if I understand correctly, it would 
save time by not pruning unreferenced objects, but it would 
still be deduping things and delta compressing also, so you 
would still likely get a great benefit from creating these 
young object packs?  In other words, is there still a good 
chance that my 317 new pack files which included a 33M pack 
file will still get consolidated down to something near 8M?  

If so, then yeah this might be nice, especially if the 
history traversal is what would speed this up.  Because 
today, my solution mostly saves IO and not time.  I think it 
still saves time, I believe I have seen up to a 50% savings, 
but that is nothing compared to massive, several orders of 
magnitude IO savings.  But if what you suggest could also 
give massive time (orders of magnitude) savings along with 
the IO improvements I am seeing, then suddenly repacking 
regularly would become very cheap even on large repos.  

The only time consuming piece would be pruning then?  Could 
bitmaps eventually help out there?

-Martin


-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation
 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] git exproll: steps to tackle gc aggression
  2013-08-08 16:56           ` Junio C Hamano
  2013-08-08 17:34             ` Martin Fick
@ 2013-08-08 17:36             ` Ramkumar Ramachandra
  2013-08-08 19:37               ` Junio C Hamano
  1 sibling, 1 reply; 33+ messages in thread
From: Ramkumar Ramachandra @ 2013-08-08 17:36 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Duy Nguyen, Martin Fick, Git List

Junio C Hamano wrote:
> So I do not see how that question is "obvious".  The question
> obviously pointless and misses the mark by wide margin?  The
> question makes it obvious that whoever asks it does not understand
> how Git works?

Shall we all sit and mourn over the fact that I don't understand how
Git works, or are you willing to explain it to me?

> And of course we do not transfer objects that are not asked for from
> or to a repository over pack tranfer.
>
> Most importantly, it is not about butchering the pack machinery in
> such a way that we can create _only_ such "non history traversal"
> packs.

I asked you a very simple question: what happens when I do "git push"?
Instead of answering the question, you butchered the pack machinery to
"only" create packs with garbage in them (aka. stripped out the
reachability analysis code completely), and blamed me for doing it.

Explain it to me in plain English without getting agitated:

1. I'm on my terminal doing various repository operations: constantly
creating new objects and moving my refs around to create unreachable
objects. I have lots of loose objects.

2. I say "git push". What happens? A reachability analysis is
performed on my loose objects, and the ones reachable by the ref I'm
sending are packed up and sent over the network. Now, I no longer have
any loose objects.

3. After a few days of working, the gc heuristics figure out that I
have too much garbage and too many packs; a cleanup is required. The
gc --auto which doesn't tolerate fragmentation: it tries to put
everything into one large pack.

Loop.

We're talking about tackling the gc aggression problem in 3. And you
propose putting the young objects in a pack without performing
reachability analysis: I'm asking how this is going to benefit me;
when I say "git push" (or when gc decides to repack), won't I need to
explode the young pack into loose objects, do a reachability analysis,
and repack anyway?

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] git exproll: steps to tackle gc aggression
  2013-08-08 17:34             ` Martin Fick
@ 2013-08-08 18:52               ` Junio C Hamano
  2013-08-08 19:14                 ` Ramkumar Ramachandra
  0 siblings, 1 reply; 33+ messages in thread
From: Junio C Hamano @ 2013-08-08 18:52 UTC (permalink / raw)
  To: Martin Fick; +Cc: Ramkumar Ramachandra, Duy Nguyen, Git List

Martin Fick <mfick@codeaurora.org> writes:

> Assuming I understand what you are suggesting, would these 
> "young object" likely still get "deduped" in an efficient 
> way without doing history traversal (it sounds like they 
> would)?

Yes.

The very first thing pack-object machinery does is to get the list
of object names and sort them in a certain order to help producing
good deltas, and this initial input preprocessing will dedup them.

> If so, then yeah this might be nice, especially if the history
> traversal is what would speed this up.

That was the assumption behind the "it might help" suggestion.  If
that helps or not is not known yet, and since Ram started this
subthread telling me not to talk about performance improvements, my
time on this thread is _not_ spent on that (yet).

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] git exproll: steps to tackle gc aggression
  2013-08-08 18:52               ` Junio C Hamano
@ 2013-08-08 19:14                 ` Ramkumar Ramachandra
  0 siblings, 0 replies; 33+ messages in thread
From: Ramkumar Ramachandra @ 2013-08-08 19:14 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Martin Fick, Duy Nguyen, Git List

Junio C Hamano wrote:
> Martin Fick <mfick@codeaurora.org> writes:
>> Assuming I understand what you are suggesting, would these
>> "young object" likely still get "deduped" in an efficient
>> way without doing history traversal (it sounds like they
>> would)?
>
> Yes.
>
> The very first thing pack-object machinery does is to get the list
> of object names and sort them in a certain order to help producing
> good deltas, and this initial input preprocessing will dedup them.

So, the proposal is to create an index of young objects without doing
reachability analysis (I still didn't get the point of packing them;
as I pointed out, it seems to be rather counter-productive) to help
the actual packing? From what I vaguely understood:

1. Index all the young objects to save a history traversal (?)

2. Perform the reachability analysis using the index in step 1, and
then generate the pack.

I'm not yet clear about what information 1 contains to help 2. Is it
the rough ordering? (The big important objects come near the top of
the pack, and the deltas are generated against them). I say "rough"
because the ordering might change after the unreachable objects are
pruned.

*scratches head*

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] git exproll: steps to tackle gc aggression
  2013-08-08 17:36             ` Ramkumar Ramachandra
@ 2013-08-08 19:37               ` Junio C Hamano
  2013-08-08 20:04                 ` Ramkumar Ramachandra
  0 siblings, 1 reply; 33+ messages in thread
From: Junio C Hamano @ 2013-08-08 19:37 UTC (permalink / raw)
  To: Ramkumar Ramachandra; +Cc: Duy Nguyen, Martin Fick, Git List

Ramkumar Ramachandra <artagnon@gmail.com> writes:

> I asked you a very simple question: what happens when I do "git push"?

You asked "does push send unreachable cruft?" and I said No.  Does
that answer your question?

> 3. After a few days of working, the gc heuristics figure out that I
> have too much garbage and too many packs; a cleanup is required. The
> gc --auto which doesn't tolerate fragmentation: it tries to put
> everything into one large pack.

Today's "gc --auto" skims the history shallowly and packs loose
objects into a single _new_ pack.  If you start from a single pack
and enough loose objects and run "gc --auto", you will have two
packs, one intact from the original, one new.

> We're talking about tackling the gc aggression problem in 3.

> when I say "git push" (or when gc decides to repack), won't I need
> to explode the young pack into loose objects, do a reachability
> analysis, and repack anyway?

You do not explode anything.  "push" will read objects from
anywhere, either from loose or packed, and send a newly created pack
over the wire.

And I see from "(or when ...)" part that you do not know what you
are talking about.  "push" will not stream existing pack to the
other end without computation, and it never will.  It will reuse
existing individual pieces when it can, and having data in a pack
(even without deltifying, or as a suboptimal delta) is much better
for push performance than having the same data in a loose object if
only for this reason, as you do not have to recompress and reencode
it in a different way (loose objects and undelitifed object in pack
are encoded differently).

> ... are you willing to explain it to me?

Hope that the above clarifies.

I've treated this thread as a design discussion, not an education
session, but the above ended up having more education material than
anything that would advance the design.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] git exproll: steps to tackle gc aggression
  2013-08-08 19:37               ` Junio C Hamano
@ 2013-08-08 20:04                 ` Ramkumar Ramachandra
  2013-08-08 21:09                   ` Martin Langhoff
  2013-08-09 11:00                   ` Jeff King
  0 siblings, 2 replies; 33+ messages in thread
From: Ramkumar Ramachandra @ 2013-08-08 20:04 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Duy Nguyen, Martin Fick, Git List

Junio C Hamano wrote:
>> 3. After a few days of working, the gc heuristics figure out that I
>> have too much garbage and too many packs; a cleanup is required. The
>> gc --auto which doesn't tolerate fragmentation: it tries to put
>> everything into one large pack.
>
> Today's "gc --auto" skims the history shallowly and packs loose
> objects into a single _new_ pack.  If you start from a single pack
> and enough loose objects and run "gc --auto", you will have two
> packs, one intact from the original, one new.

That is when I have lots of loose objects and few packs. We are
discussing gc aggression when there are too many packs (how did we get
there in the first place if new packs aren't created?): in which case
it doesn't tolerate fragmentation, and tries to put everything in one
pack. That is both IO and CPU intensive, and we've been trying to
tackle that since the start of the thread.

> And I see from "(or when ...)" part that you do not know what you
> are talking about.  "push" will not stream existing pack to the
> other end without computation, and it never will.  It will reuse
> existing individual pieces when it can, and having data in a pack
> (even without deltifying, or as a suboptimal delta) is much better
> for push performance than having the same data in a loose object if
> only for this reason, as you do not have to recompress and reencode
> it in a different way (loose objects and undelitifed object in pack
> are encoded differently).

Certainly. A push will never use an existing pack as-is, as it's very
highly unlikely that the server requested exactly what gc --auto
packed for us locally.

Sure, undeltified objects in the pack are probably better for push
performance, but I'm talking about the majority: deltified objects.
Don't you need to apply the deltas (ie. explode the pack), before you
can recompute the deltas for the information you're sending across for
the push?

> I've treated this thread as a design discussion, not an education
> session, but the above ended up having more education material than
> anything that would advance the design.

You will need to educate your contributors and potential contributors
if you want these problems to be fixed.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] git exproll: steps to tackle gc aggression
  2013-08-08 20:04                 ` Ramkumar Ramachandra
@ 2013-08-08 21:09                   ` Martin Langhoff
  2013-08-09 11:00                   ` Jeff King
  1 sibling, 0 replies; 33+ messages in thread
From: Martin Langhoff @ 2013-08-08 21:09 UTC (permalink / raw)
  To: Ramkumar Ramachandra; +Cc: Junio C Hamano, Duy Nguyen, Martin Fick, Git List

On Thu, Aug 8, 2013 at 4:04 PM, Ramkumar Ramachandra <artagnon@gmail.com> wrote:
> You will need to educate your contributors and potential contributors
> if you want these problems to be fixed.

TBH, I think Junio is exceedingly kind and generous with his time.

IME (and there's quite a few years of it :-) ), good contributors
learn a little bit asking, and ton on their own.

Your tone is quite demanding, which I personally find... rather out of place.



m
-- 
 martin.langhoff@gmail.com
 -  ask interesting questions
 - don't get distracted with shiny stuff  - working code first
 ~ http://docs.moodle.org/en/User:Martin_Langhoff

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] git exproll: steps to tackle gc aggression
  2013-08-08 20:04                 ` Ramkumar Ramachandra
  2013-08-08 21:09                   ` Martin Langhoff
@ 2013-08-09 11:00                   ` Jeff King
  2013-08-09 13:34                     ` Ramkumar Ramachandra
  1 sibling, 1 reply; 33+ messages in thread
From: Jeff King @ 2013-08-09 11:00 UTC (permalink / raw)
  To: Ramkumar Ramachandra; +Cc: Junio C Hamano, Duy Nguyen, Martin Fick, Git List

On Fri, Aug 09, 2013 at 01:34:48AM +0530, Ramkumar Ramachandra wrote:

> Certainly. A push will never use an existing pack as-is, as it's very
> highly unlikely that the server requested exactly what gc --auto
> packed for us locally.
> 
> Sure, undeltified objects in the pack are probably better for push
> performance, but I'm talking about the majority: deltified objects.
> Don't you need to apply the deltas (ie. explode the pack), before you
> can recompute the deltas for the information you're sending across for
> the push?

It depends on what each side has it, doesn't it? We generally try to
reuse on-disk deltas when we can, since they require no computation. If
I have object A delta'd against B, and I know that the other side wants
A and has B (or I am also sending B), I can simply send what I have on
disk. So we do not just blit out the existing pack as-is, but we may
reuse portions of it as appropriate.

Of course we may have to reconstruct deltas for trees in order to find
the correct set of objects (i.e., the history traversal). But that is a
separate phase from generating the pack's object content, and we do not
reuse any of the traversal work in later phases.

-Peff

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] git exproll: steps to tackle gc aggression
  2013-08-09 11:00                   ` Jeff King
@ 2013-08-09 13:34                     ` Ramkumar Ramachandra
  2013-08-09 17:35                       ` Junio C Hamano
  2013-08-09 22:16                       ` Jeff King
  0 siblings, 2 replies; 33+ messages in thread
From: Ramkumar Ramachandra @ 2013-08-09 13:34 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Duy Nguyen, Martin Fick, Git List

Jeff King wrote:
> It depends on what each side has it, doesn't it? We generally try to
> reuse on-disk deltas when we can, since they require no computation. If
> I have object A delta'd against B, and I know that the other side wants
> A and has B (or I am also sending B), I can simply send what I have on
> disk. So we do not just blit out the existing pack as-is, but we may
> reuse portions of it as appropriate.

I'll raise some (hopefully interesting) points. Let's take the example
of a simple push: I start send-pack, which in turn starts receive_pack
on the server and connects its stdin/stdout to it (using git_connect).
Now, it reads the (sha1, ref) pairs it receives on stdin and spawns
pack-objects --stdout with the right arguments as the response, right?
Overall, nothing special: just pack-objects invoked with specific
arguments.

How does pack-objects work? ll_find_deltas() spawns some worker
threads to find_deltas(). This is where this get fuzzy for me: it
calls try_delta() in a nested loop, trying to find the smallest delta,
right? I'm not sure whether the interfaces it uses to read objects
differentiates between packed deltas versus packed undeltified objects
versus loose objects at all.

Now, the main problem I see is that a pack has to be self-contained: I
can't have an object "bar" which is a delta against an object that is
not present in the pack, right? Therefore no matter what the server
already has, I have to prepare deltas only against the data that I'm
sending across.

> Of course we may have to reconstruct deltas for trees in order to find
> the correct set of objects (i.e., the history traversal). But that is a
> separate phase from generating the pack's object content, and we do not
> reuse any of the traversal work in later phases.

I see. Are we talking about tree-walk.c here? This is not unique to
packing at all; we need to do this kind of traversal for any git
operation that digs into older history, no? I recall a discussion
about using generation numbers to speed up the walk: I tried playing
with your series (where you use a cache to keep the generation
numbers), but got nowhere. Does it make sense to think about speeding
up the walk (how?).

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] git exproll: steps to tackle gc aggression
  2013-08-09 13:34                     ` Ramkumar Ramachandra
@ 2013-08-09 17:35                       ` Junio C Hamano
  2013-08-09 22:16                       ` Jeff King
  1 sibling, 0 replies; 33+ messages in thread
From: Junio C Hamano @ 2013-08-09 17:35 UTC (permalink / raw)
  To: Ramkumar Ramachandra; +Cc: Jeff King, Duy Nguyen, Martin Fick, Git List

Ramkumar Ramachandra <artagnon@gmail.com> writes:

> I'll raise some (hopefully interesting) points. Let's take the example
> of a simple push: I start send-pack, which in turn starts receive_pack
> on the server and connects its stdin/stdout to it (using git_connect).
> Now, it reads the (sha1, ref) pairs it receives on stdin and spawns
> pack-objects --stdout with the right arguments as the response, right?
> Overall, nothing special: just pack-objects invoked with specific
> arguments.
>
> How does pack-objects work?  ...

You've been here long enough to know that you can and should learn
things yourself instead of repeating "tell me tell me" ;-)

> threads to find_deltas(). This is where this get fuzzy for me: it
> calls try_delta() in a nested loop, trying to find the smallest delta,
> right?

Yes.

> I'm not sure whether the interfaces it uses to read objects
> differentiates between packed deltas versus packed undeltified objects
> versus loose objects at all.

Yes, but that happens way before that.  Start reading from
pack-heuristics document to get the general overall feel of what
goes on, and then go back to the source.

> Now, the main problem I see is that a pack has to be self-contained: I
> can't have an object "bar" which is a delta against an object that is
> not present in the pack, right? Therefore no matter what the server
> already has, I have to prepare deltas only against the data that I'm
> sending across.

There is a --thin option to allow deltas against base objects that
are known to exist on the other end to be used in the pack without
including the base objects.  The receiving end then re-adds the base
objects to the received data to complete the pack.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] git exproll: steps to tackle gc aggression
  2013-08-09 13:34                     ` Ramkumar Ramachandra
  2013-08-09 17:35                       ` Junio C Hamano
@ 2013-08-09 22:16                       ` Jeff King
  2013-08-10  1:24                         ` Duy Nguyen
                                           ` (2 more replies)
  1 sibling, 3 replies; 33+ messages in thread
From: Jeff King @ 2013-08-09 22:16 UTC (permalink / raw)
  To: Ramkumar Ramachandra; +Cc: Junio C Hamano, Duy Nguyen, Martin Fick, Git List

On Fri, Aug 09, 2013 at 07:04:23PM +0530, Ramkumar Ramachandra wrote:

> I'll raise some (hopefully interesting) points. Let's take the example
> of a simple push: I start send-pack, which in turn starts receive_pack
> on the server and connects its stdin/stdout to it (using git_connect).
> Now, it reads the (sha1, ref) pairs it receives on stdin and spawns
> pack-objects --stdout with the right arguments as the response, right?
> Overall, nothing special: just pack-objects invoked with specific
> arguments.
> 
> How does pack-objects work? ll_find_deltas() spawns some worker
> threads to find_deltas(). This is where this get fuzzy for me: it
> calls try_delta() in a nested loop, trying to find the smallest delta,
> right? I'm not sure whether the interfaces it uses to read objects
> differentiates between packed deltas versus packed undeltified objects
> versus loose objects at all.

You are missing a step in the preparation of the delta list. If we
already have a delta on-disk, we will check whether it is usable and
keep a note in the list (see check_object). Then when we
prepare the list of objects to delta, it is omitted from the list (see
prepare_pack).

That is why you may see a much smaller number objects in the progress
eye candy for "Compressing objects..." than we are actually sending.

> Now, the main problem I see is that a pack has to be self-contained: I
> can't have an object "bar" which is a delta against an object that is
> not present in the pack, right? Therefore no matter what the server
> already has, I have to prepare deltas only against the data that I'm
> sending across.

As Junio mentioned, that is what "--thin" is about; the sender omits the
base and the receiver adds it back in ("index-pack --fix-thin"). And if
you think about it, that is likely where most of Martin's "317 packs
turned into 8MB" space savings are coming from.

Somebody pushes X, which deltas against Y. They send only the delta for
X->Y, but the receiver stores that delta plus a full copy of Y, even
though we already have Y in another pack. Repacking throws away the
extra copy of Y.

And that is why something like Junio's suggestion of "do not traverse
history; just concatenate the packs and throw away duplicate objects"
might be helpful. You would not find new deltas, but you would get rid
of these duplicate objects, and doing so would be relatively cheap.

Another solution could involve not writing the duplicate of Y in the
first place. The reason we do not store thin-packs on disk is that you
run into problems with cycles in the delta graph (e.g., A deltas against
B, which deltas against C, which deltas against A; at one point you had
a full copy of one object which let you create the cycle, but you later
deleted it as redundant with the delta, and now you cannot reconstruct
any of the objects).

You could possibly solve this with cycle detection, though it would be
complicated (you need to do it not just when getting rid of objects, but
when sending a pack, to make sure you don't send a cycle of deltas that
the other end cannot use). You _might_ be able to get by with a kind of
"two-level" hack: consider your main pack as "group A" and newly pushed
packs as "group B". Allow storing thin deltas on disk from group B
against group A, but never the reverse (nor within group B). That makes
sure you don't have cycles, and it eliminates even more I/O than any
repacking solution (because you never write the extra copy of Y to disk
in the first place). But I can think of two problems:

  1. You still want to repack more often than every 300 packs, because
     having many packs cost both in space, but also in object lookup
     time (we can do a log(N) search through each pack index, but have
     to search linearly through the set of indices).

  2. As you accumulate group B packs with new objects, the deltas that
     people send will tend to be against objects in group B. They are
     closer to the tip of history, and therefore make better deltas for
     history built on top.

That's all just off the top of my head. There are probably other flaws,
too, as I haven't considered it too hard. But if you are really
concerned about I/O on a busy repo (and I think for hosting sites, it is
a real problem), the best-performing solution will probably not involve
packs at all, but some kind of delta-capable storage that you can
incrementally add to without rewriting the whole repository. The
downside is that it would be significantly more complicated.

> > Of course we may have to reconstruct deltas for trees in order to find
> > the correct set of objects (i.e., the history traversal). But that is a
> > separate phase from generating the pack's object content, and we do not
> > reuse any of the traversal work in later phases.
> 
> I see. Are we talking about tree-walk.c here? This is not unique to
> packing at all; we need to do this kind of traversal for any git
> operation that digs into older history, no? I recall a discussion
> about using generation numbers to speed up the walk: I tried playing
> with your series (where you use a cache to keep the generation
> numbers), but got nowhere. Does it make sense to think about speeding
> up the walk (how?).

It's actually list-objects.c, and no, it is not unique to packing (you
can run the same code with "git rev-list --objects"). The CPU time for such
a traversal is generally spent in:

  1. Accessing commit objects data from disk (this shows up as libz in
     perf runs).

  2. Accessing tree object data from disk (libz again).

  3. Reconstructing tree deltas (this shows up as memcpy in perf, but of
     course we have to access base objects, which increases item 2).

  4. Converting sha1s into "struct object" via hash table lookup
     (lookup_object in perf).

Generation numbers can make some commit walks faster by cutting off the
traversal early. But they don't really apply here, for a few reasons.
One, our worst case is actually a full clone or prune, and there is no
cutoff; we must go down to the roots. Two, we use commit timestamps as a
proxy for generation numbers. Most of the traversal algorithms use the
timestamps in a priority queue, and are written such that we always get
the right answer in the face of skewed timestamps, but we may go deeper
than necessary if they are skewed. The exception is the depth-first
"tag --contains" algorithm, which typically goes to the roots (and if
you limit on timestamp, will give the wrong answer in the face of skew).
And that is the algorithm that is sped up by having generation numbers.

So I do not think they help with (1). You can speed up (1) by storing
commit information in a more readily accessible format (e.g., see the
"commit metapack" patches I posted in March or April of this year). But
you cannot win all that much here. Try "git rev-list --all" on the
kernel versus "git rev-list --objects --all". The former touches only
the commits, and takes about 5 seconds. The latter touches all objects
and takes about 30 seconds. So even if you could make the commit access
and traversal instantaneous, you could save only 5 out of 30 seconds.

For (2), I experimented with leaving tree objects uncompressed in the
pack. It does help (and you do not lose much space, because tree objects
compress horribly, as they are mostly random sha1 bytes). But not that
much.

For (3), we keep a cache of delta bases to avoid re-accessing the same
tree bases over and over. Usually this works fine, but I can into a case
recently where "rev-list --objects" was halved by bumping
core.deltabasecachelimit from its default of 16m to 128m (this
particular repository had some large trees, and a very odd pattern of
merges that I think made the locality of tree access very poor).

Which leaves (4), the hash lookup. We have to do a _lot_ of these,
because we have to look up each entry of each tree. So we end up looking
up the same sha1s over and over again (e.g., we see that "Makefile" is at
sha1 "1234abcd", and we output it. Then we go to the tree of the parent
commit, and we look at its "Makefile" entry. Guess where it almost
certainly still points, because any given commit only touches a handful
of paths?).

There have been several attempts to make this faster. Micro-optimizing
hashcmp, which helps a little. I looked at quadratic probing in the hash
table, and Junio looked at cuckoo hashing.  Neither provided good
results. I looked at exploiting locality of the repeated lookups, which
resulted in 9a41448.

But ultimately, something like reachability bitmaps is going to perform
a lot better, because you can avoid doing the traversal at all. Another
solution which we've discussed but AFAIK nobody has ever implemented is
having a cache to show which entries changed between two trees (e.g.,
the tree of a commit and the tree of one of its children). That would
let us skip the repeated hash lookups (you would know whether you
already processed the child's tree and its entries, and if so, you know
that you only need to process the differences to get complete coverage).


That rambled off topic a bit and got long. But you seem to have a lot of
questions in this area, so the simplest thing seemed to be to just dump
the current state of affairs. I hope it is useful.

-Peff

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] git exproll: steps to tackle gc aggression
  2013-08-09 22:16                       ` Jeff King
@ 2013-08-10  1:24                         ` Duy Nguyen
  2013-08-10  9:50                           ` Jeff King
  2013-08-10  5:26                         ` Junio C Hamano
  2013-08-10  8:42                         ` Ramkumar Ramachandra
  2 siblings, 1 reply; 33+ messages in thread
From: Duy Nguyen @ 2013-08-10  1:24 UTC (permalink / raw)
  To: Jeff King; +Cc: Ramkumar Ramachandra, Junio C Hamano, Martin Fick, Git List

On Sat, Aug 10, 2013 at 5:16 AM, Jeff King <peff@peff.net> wrote:
> Another solution could involve not writing the duplicate of Y in the
> first place. The reason we do not store thin-packs on disk is that you
> run into problems with cycles in the delta graph (e.g., A deltas against
> B, which deltas against C, which deltas against A; at one point you had
> a full copy of one object which let you create the cycle, but you later
> deleted it as redundant with the delta, and now you cannot reconstruct
> any of the objects).
>
> You could possibly solve this with cycle detection, though it would be
> complicated (you need to do it not just when getting rid of objects, but
> when sending a pack, to make sure you don't send a cycle of deltas that
> the other end cannot use). You _might_ be able to get by with a kind of
> "two-level" hack: consider your main pack as "group A" and newly pushed
> packs as "group B". Allow storing thin deltas on disk from group B
> against group A, but never the reverse (nor within group B). That makes
> sure you don't have cycles, and it eliminates even more I/O than any
> repacking solution (because you never write the extra copy of Y to disk
> in the first place). But I can think of two problems:
>
>   1. You still want to repack more often than every 300 packs, because
>      having many packs cost both in space, but also in object lookup
>      time (we can do a log(N) search through each pack index, but have
>      to search linearly through the set of indices).
>
>   2. As you accumulate group B packs with new objects, the deltas that
>      people send will tend to be against objects in group B. They are
>      closer to the tip of history, and therefore make better deltas for
>      history built on top.
>
> That's all just off the top of my head. There are probably other flaws,
> too, as I haven't considered it too hard.

Some refinements on this idea

 - We could keep packs in group B ordered as the packs come in. The
new pack can depend on the previous ones.
 - A group index in addition to separate index for each pack would
solve linear search object lookup problem.
-- 
Duy

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] git exproll: steps to tackle gc aggression
  2013-08-09 22:16                       ` Jeff King
  2013-08-10  1:24                         ` Duy Nguyen
@ 2013-08-10  5:26                         ` Junio C Hamano
  2013-08-10  8:42                         ` Ramkumar Ramachandra
  2 siblings, 0 replies; 33+ messages in thread
From: Junio C Hamano @ 2013-08-10  5:26 UTC (permalink / raw)
  To: Jeff King; +Cc: Ramkumar Ramachandra, Duy Nguyen, Martin Fick, Git List

Jeff King <peff@peff.net> writes:

> ... The reason we do not store thin-packs on disk is that you
> run into problems with cycles in the delta graph (e.g., A deltas against
> B, which deltas against C, which deltas against A; at one point you had
> a full copy of one object which let you create the cycle, but you later
> deleted it as redundant with the delta, and now you cannot reconstruct
> any of the objects).

As an extension to that is that a thin-pack will make the "validity"
of a single pack dependent on its surroundings, making "verify-pack"
useless.  It may say "everything is fine", but corruption of another
pack that holds a base object a delta in it depends on will render
the pack unusable.

With the current arrangement, if you grab a single pack and re-idx
it, you know you can reconstruct all the objects in it.

The original reason was not cycles; it was primarily because we did
not want to have to map more than one packfile while reconstructing
the delta chain for a single object (this was way before the "open
small mmap windows into a large packfile" was invented).

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] git exproll: steps to tackle gc aggression
  2013-08-09 22:16                       ` Jeff King
  2013-08-10  1:24                         ` Duy Nguyen
  2013-08-10  5:26                         ` Junio C Hamano
@ 2013-08-10  8:42                         ` Ramkumar Ramachandra
  2013-08-10  9:24                           ` Duy Nguyen
  2013-08-10  9:38                           ` Jeff King
  2 siblings, 2 replies; 33+ messages in thread
From: Ramkumar Ramachandra @ 2013-08-10  8:42 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Duy Nguyen, Martin Fick, Git List

Jeff King wrote:
> [...]

First off, thanks for the fabulous writeup. I hope more contributors
read this, and get interested in tackling the problems.

> You are missing a step in the preparation of the delta list. If we
> already have a delta on-disk, we will check whether it is usable and
> keep a note in the list (see check_object). Then when we
> prepare the list of objects to delta, it is omitted from the list (see
> prepare_pack).

Ah, prepare_pack() calls get_object_details(), which then calls
check_object(): reuse_delta is what I was looking for.

> That is why you may see a much smaller number objects in the progress
> eye candy for "Compressing objects..." than we are actually sending.

I see.

> As Junio mentioned, that is what "--thin" is about; the sender omits the
> base and the receiver adds it back in ("index-pack --fix-thin").

Yeah, I read about the --thin option. However, it's only for
network-packs (i.e --stdout; why?). Also, is it turned on by default?
The documentation says so, but I ran it and found that the value of
thin is 0 in builtin/push.c:316. What is going on?

> And if
> you think about it, that is likely where most of Martin's "317 packs
> turned into 8MB" space savings are coming from.

Thin packs have nothing to do with git-exproll per-se, but I think I
understand why pack generations work.

> The reason we do not store thin-packs on disk is that you
> run into problems with cycles in the delta graph (e.g., A deltas against
> B, which deltas against C, which deltas against A; at one point you had
> a full copy of one object which let you create the cycle, but you later
> deleted it as redundant with the delta, and now you cannot reconstruct
> any of the objects).

Ah, that's why.

> Somebody pushes X, which deltas against Y. They send only the delta for
> X->Y, but the receiver stores that delta plus a full copy of Y, even
> though we already have Y in another pack. Repacking throws away the
> extra copy of Y.

Right, I got that.

> And that is why something like Junio's suggestion of "do not traverse
> history; just concatenate the packs and throw away duplicate objects"
> might be helpful. You would not find new deltas, but you would get rid
> of these duplicate objects, and doing so would be relatively cheap.

... but we've already determined that git pack-objects does a good job
of reusing deltas, so I'm not sure what we can do over and above this.

If I understand correctly, the proposal is to use existing (but
sub-optimal) deltas more aggressively when repacking?

> Another solution could involve not writing the duplicate of Y in the
> first place.

> [...]

> You could possibly solve this with cycle detection, though it would be
> complicated (you need to do it not just when getting rid of objects, but
> when sending a pack, to make sure you don't send a cycle of deltas that
> the other end cannot use). You _might_ be able to get by with a kind of
> "two-level" hack: consider your main pack as "group A" and newly pushed
> packs as "group B". Allow storing thin deltas on disk from group B
> against group A, but never the reverse (nor within group B). That makes
> sure you don't have cycles, and it eliminates even more I/O than any
> repacking solution (because you never write the extra copy of Y to disk
> in the first place). But I can think of two problems:

Hm, having two kinds of packs: full packs and thin packs, and
different codepaths to handle them.

>   1. You still want to repack more often than every 300 packs, because
>      having many packs cost both in space, but also in object lookup
>      time (we can do a log(N) search through each pack index, but have
>      to search linearly through the set of indices).

Once you narrow it down to a pack-index, you can bisect it since the
table is sorted. How do you find the right pack-index though? Do I
have to do the bisection-search on each one to see if the object
exists? Can this be improved by generating a global index for all
packs that contains packfile-name in addition to offset for each
object?

>   2. As you accumulate group B packs with new objects, the deltas that
>      people send will tend to be against objects in group B. They are
>      closer to the tip of history, and therefore make better deltas for
>      history built on top.

Actually, I'm not sure it's worth the additional complexity at all.
Keeping packs self-contained means that simpler code and easier
isolation in the face of corruption: the space isn't a big trade-off,
unless there are lots of really tiny packs (in which case we
consolidate them into larger packs).

The way I see it, we'll be saving on some IO by creating thin packs.
But the trade-off seems to be quite heavy: self-contained packs means
that I can just mmap() that pack-index, look for objects and apply
deltas without bothering about any other packs. Thin packs means that
I'll have to mmap() a lot more than one pack-index, and spend a lot
more time searching for the base object to apply the delta to, right?

> That's all just off the top of my head. There are probably other flaws,
> too, as I haven't considered it too hard. But if you are really
> concerned about I/O on a busy repo (and I think for hosting sites, it is
> a real problem), the best-performing solution will probably not involve
> packs at all, but some kind of delta-capable storage that you can
> incrementally add to without rewriting the whole repository. The
> downside is that it would be significantly more complicated.

I think packs are an excellent solution. The reason is quite simple:
if you use incremental delta storage, those deltas will be useless
(aka. completely suboptimal) after a few days of additional data.
Generating effective deltas is an expensive operation that involves
lining up objects using heuristics (like size), and sliding windows
over them to get the most optimal delta. If you generate lots of
suboptimal deltas, pack-objects can't reuse them and your server will
not be able to do network operations effectively.

> It's actually list-objects.c, and no, it is not unique to packing (you
> can run the same code with "git rev-list --objects"). The CPU time for such
> a traversal is generally spent in:
>
>   1. Accessing commit objects data from disk (this shows up as libz in
>      perf runs).
>
>   2. Accessing tree object data from disk (libz again).
>
>   3. Reconstructing tree deltas (this shows up as memcpy in perf, but of
>      course we have to access base objects, which increases item 2).
>
>   4. Converting sha1s into "struct object" via hash table lookup
>      (lookup_object in perf).

Makes sense.

> Generation numbers can make some commit walks faster by cutting off the
> traversal early. But they don't really apply here, for a few reasons.
> One, our worst case is actually a full clone or prune, and there is no
> cutoff; we must go down to the roots. Two, we use commit timestamps as a
> proxy for generation numbers. Most of the traversal algorithms use the
> timestamps in a priority queue, and are written such that we always get
> the right answer in the face of skewed timestamps, but we may go deeper
> than necessary if they are skewed.

Yeah, I noticed. The main advantage of using generation numbers seemed
to be the elimination of SLOP from revision.c:limit_list(); slightly
cleaner code, but not much else. I was wondering if there was
something more to it.

> The exception is the depth-first
> "tag --contains" algorithm, which typically goes to the roots (and if
> you limit on timestamp, will give the wrong answer in the face of skew).
> And that is the algorithm that is sped up by having generation numbers.

Ah, I see.

> So I do not think they help with (1). You can speed up (1) by storing
> commit information in a more readily accessible format (e.g., see the
> "commit metapack" patches I posted in March or April of this year). But
> you cannot win all that much here. Try "git rev-list --all" on the
> kernel versus "git rev-list --objects --all". The former touches only
> the commits, and takes about 5 seconds. The latter touches all objects
> and takes about 30 seconds. So even if you could make the commit access
> and traversal instantaneous, you could save only 5 out of 30 seconds.

Oh, ouch. I haven't gotten around to reading the "commit metapack" series yet.

> For (2), I experimented with leaving tree objects uncompressed in the
> pack. It does help (and you do not lose much space, because tree objects
> compress horribly, as they are mostly random sha1 bytes). But not that
> much.

Interesting point: tree objects cannot be compressed: if I do one
file/directory rename, the old 20-byte hash and new 20-byte hash have
no relationship to each other: so, we're wasting time trying to find
patterns (for compression) in trees. But yeah, eliminating tree
compression might not help much, because there are so few trees: on a
typical repository, there are very few renames.

> For (3), we keep a cache of delta bases to avoid re-accessing the same
> tree bases over and over. Usually this works fine, but I can into a case
> recently where "rev-list --objects" was halved by bumping
> core.deltabasecachelimit from its default of 16m to 128m (this
> particular repository had some large trees, and a very odd pattern of
> merges that I think made the locality of tree access very poor).

Yeah, I saw the LRU delta_base_cache in sha1_file.c. It's probably
already as optimal as it can be.

> Which leaves (4), the hash lookup. We have to do a _lot_ of these,
> because we have to look up each entry of each tree. So we end up looking
> up the same sha1s over and over again (e.g., we see that "Makefile" is at
> sha1 "1234abcd", and we output it. Then we go to the tree of the parent
> commit, and we look at its "Makefile" entry. Guess where it almost
> certainly still points, because any given commit only touches a handful
> of paths?).
>
> There have been several attempts to make this faster. Micro-optimizing
> hashcmp, which helps a little. I looked at quadratic probing in the hash
> table, and Junio looked at cuckoo hashing.  Neither provided good
> results. I looked at exploiting locality of the repeated lookups, which
> resulted in 9a41448.

To lower the cost of linear lookup during a collision, you move the
object you hit up the chain for subsequent lookups. Very interesting.

> But ultimately, something like reachability bitmaps is going to perform
> a lot better, because you can avoid doing the traversal at all. Another
> solution which we've discussed but AFAIK nobody has ever implemented is
> having a cache to show which entries changed between two trees (e.g.,
> the tree of a commit and the tree of one of its children). That would
> let us skip the repeated hash lookups (you would know whether you
> already processed the child's tree and its entries, and if so, you know
> that you only need to process the differences to get complete coverage).

It'll take me some time to digest this information and come up with
questions/ patches.

Thanks for the food for thought.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] git exproll: steps to tackle gc aggression
  2013-08-10  8:42                         ` Ramkumar Ramachandra
@ 2013-08-10  9:24                           ` Duy Nguyen
  2013-08-10  9:28                             ` Duy Nguyen
  2013-08-10  9:43                             ` Jeff King
  2013-08-10  9:38                           ` Jeff King
  1 sibling, 2 replies; 33+ messages in thread
From: Duy Nguyen @ 2013-08-10  9:24 UTC (permalink / raw)
  To: Ramkumar Ramachandra; +Cc: Jeff King, Junio C Hamano, Martin Fick, Git List

On Sat, Aug 10, 2013 at 3:42 PM, Ramkumar Ramachandra
<artagnon@gmail.com> wrote:
>> As Junio mentioned, that is what "--thin" is about; the sender omits the
>> base and the receiver adds it back in ("index-pack --fix-thin").
>
> Yeah, I read about the --thin option. However, it's only for
> network-packs (i.e --stdout; why?). Also, is it turned on by default?
> The documentation says so, but I ran it and found that the value of
> thin is 0 in builtin/push.c:316. What is going on?

--thin is enabled by default for fetch (see
transport.c:transport_get()) but it's only effective when the server
advertises "thin-pack" capability (see protocol-capabilities.txt).
push has --thin turned off by default favoring server resources over
network traffic, see a4503a1 (Make --no-thin the default in git-push
to save server resources - 2007-09-09)
-- 
Duy

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] git exproll: steps to tackle gc aggression
  2013-08-10  9:24                           ` Duy Nguyen
@ 2013-08-10  9:28                             ` Duy Nguyen
  2013-08-10  9:43                             ` Jeff King
  1 sibling, 0 replies; 33+ messages in thread
From: Duy Nguyen @ 2013-08-10  9:28 UTC (permalink / raw)
  To: Ramkumar Ramachandra; +Cc: Jeff King, Junio C Hamano, Martin Fick, Git List

On Sat, Aug 10, 2013 at 4:24 PM, Duy Nguyen <pclouds@gmail.com> wrote:
> push has --thin turned off by default favoring server resources over
> network traffic, see a4503a1 (Make --no-thin the default in git-push
> to save server resources - 2007-09-09)

Side note, I think the server should be able to control the default
behavior. Maybe if the server advertises "thin-pack" capability and
the user does not explicitly specify --no-thin, then --thin is
automatically turned on at the client.
-- 
Duy

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] git exproll: steps to tackle gc aggression
  2013-08-10  8:42                         ` Ramkumar Ramachandra
  2013-08-10  9:24                           ` Duy Nguyen
@ 2013-08-10  9:38                           ` Jeff King
  1 sibling, 0 replies; 33+ messages in thread
From: Jeff King @ 2013-08-10  9:38 UTC (permalink / raw)
  To: Ramkumar Ramachandra; +Cc: Junio C Hamano, Duy Nguyen, Martin Fick, Git List

On Sat, Aug 10, 2013 at 02:12:46PM +0530, Ramkumar Ramachandra wrote:

> First off, thanks for the fabulous writeup. I hope more contributors
> read this, and get interested in tackling the problems.

You're welcome. I'll just respond to a few questions/points you raised
in your response:

> Yeah, I read about the --thin option. However, it's only for
> network-packs (i.e --stdout; why?). Also, is it turned on by default?
> The documentation says so, but I ran it and found that the value of
> thin is 0 in builtin/push.c:316. What is going on?

I'm not sure whether "push --thin" actually does anything anymore. It
looks like we always turn on thin-packs for the git transport in
transport_get, and have ever since 9b28851 (Push code for transport
library, 2007-09-10). And you cannot turn off thin-packing with
"--no-thin".

So I suspect it is historical cruft at this point.

> > And if
> > you think about it, that is likely where most of Martin's "317 packs
> > turned into 8MB" space savings are coming from.
> 
> Thin packs have nothing to do with git-exproll per-se, but I think I
> understand why pack generations work.

I think it matters in terms of understanding the workload, and why the
repo benefits from repacking. In a normal git repo that accumulates
packs via incremental "git repack" (e.g., by running "gc --auto" until
we hit the pack limit), you will not have duplicate objects (only
objects which could be delta'd across the pack boundary but are not).
But if you accumulate packs by repeated thin-pack pushing, you will have
good deltas, but extra copies of base objects.

> > And that is why something like Junio's suggestion of "do not traverse
> > history; just concatenate the packs and throw away duplicate objects"
> > might be helpful. You would not find new deltas, but you would get rid
> > of these duplicate objects, and doing so would be relatively cheap.
> 
> ... but we've already determined that git pack-objects does a good job
> of reusing deltas, so I'm not sure what we can do over and above this.

If I understand the proposal correctly, the problem space is something
like:

  A fully packed repository is the best way to store objects, both in
  terms of space efficiency and for delta reuse when clients fetch from
  us. But it is too expensive to do a full repack after each push. What
  makes it expensive, and how can we get around that?

I think the answer to "why is it expensive" is two-fold:

  1. We spend a lot of CPU on history traversal. Even if we have only 10
     new commits to pack, we have to traverse the _whole_ history to
     generate the pack.

  2. We spend some CPU on delta compression. This scales nicely (10
     commits means we only do deltas for those 10 commits).

  3. We do a lot of I/O, as we rewrite the whole of the old pack content
     into the new pack, along with the extra objects (so again, we might
     rewrite millions of objects just to add 10 more to the pack).

I do not think (2) is that big a deal. It scales nicely already, and the
deltas you get are worth it (and you would have to compute them on the
fly for the next clone anyway). But (1) and (3) are bad, because they
are based on the size of the repository, not the size of the new
content.

If we were to simply append the objects from the new packs onto
the big, old pack, dropping any duplicates but retaining the deltas we
have, we would make (1) and (3) go away. Our pack probably wouldn't be
as good as a "real" repack, as it would potentially contain unreachable
objects, and we might miss some delta opportunities. But we could
cheaply run the "concatenate" repack after each push, and then do a real
repack less frequently.

I am not sure if that is exactly what Junio is proposing. There are some
logistical complications (fixing up the pack header and footer, along
with the filename, and updating the pack index).

> >   1. You still want to repack more often than every 300 packs, because
> >      having many packs cost both in space, but also in object lookup
> >      time (we can do a log(N) search through each pack index, but have
> >      to search linearly through the set of indices).
> 
> Once you narrow it down to a pack-index, you can bisect it since the
> table is sorted. How do you find the right pack-index though? Do I
> have to do the bisection-search on each one to see if the object
> exists? Can this be improved by generating a global index for all
> packs that contains packfile-name in addition to offset for each
> object?

Right. The bisection-search is the "log(N)" I mention above; you can do
that on each pack index. But you have to search the packs linearly,
since there is no indication of which pack the object might be found in.

A global index would solve that. Right now, the way to create such an
index is "git repack -ad". :)

But there is nothing to say that we could not have a meta-index on top
of the existing pack indices (however it does introduce complications
with lifetimes, as right now we can delete redundant packs and their
indices).

> >   2. As you accumulate group B packs with new objects, the deltas that
> >      people send will tend to be against objects in group B. They are
> >      closer to the tip of history, and therefore make better deltas for
> >      history built on top.
> 
> Actually, I'm not sure it's worth the additional complexity at all.

Oh, I'm not either. I was mostly thinking out loud. :)

> Keeping packs self-contained means that simpler code and easier
> isolation in the face of corruption: the space isn't a big trade-off,
> unless there are lots of really tiny packs (in which case we
> consolidate them into larger packs).

Exactly. The system right now is simple and robust, even if it has some
inefficiencies.

> The way I see it, we'll be saving on some IO by creating thin packs.
> But the trade-off seems to be quite heavy: self-contained packs means
> that I can just mmap() that pack-index, look for objects and apply
> deltas without bothering about any other packs. Thin packs means that
> I'll have to mmap() a lot more than one pack-index, and spend a lot
> more time searching for the base object to apply the delta to, right?

Yes. Right now when we search for a base, we know it is in the same
pack. So right now if you have N packs, we might do N binary searches to
look up a single object. But with cross-pack deltas, we might do N
searches per delta link; an object with a 50-deep delta chain might do
50N such lookups.

So any solution along these lines would probably want to have a global
index for the thin packs, or some other way of mitigating this.

> I think packs are an excellent solution. The reason is quite simple:
> if you use incremental delta storage, those deltas will be useless
> (aka. completely suboptimal) after a few days of additional data.
> Generating effective deltas is an expensive operation that involves
> lining up objects using heuristics (like size), and sliding windows
> over them to get the most optimal delta. If you generate lots of
> suboptimal deltas, pack-objects can't reuse them and your server will
> not be able to do network operations effectively.

I don't know that the deltas are useless or suboptimal after a few days.
Yes, we might have better deltas if we did a full repack, but I think it
would very much depend on the workload and the quality of deltas the
clients are pushing to you.

However, there is one thing we didn't talk about yet: the direction of
deltas. When we repack, we try to store the most recent items as bases,
and then delta going further back in history. This makes accessing
recent objects (which most workloads should do more of) cheaper, and I
imagine it helps with the delta base cache, too (since we traverse
backwards in time).

But deltas you get from other packs may not be in that order (in fact,
if they are from thin packs, they are almost certainly in the opposite
order, as the server and client have some shared objects that are older,
and then the client pushes a delta for the new object against the old).

> Interesting point: tree objects cannot be compressed: if I do one
> file/directory rename, the old 20-byte hash and new 20-byte hash have
> no relationship to each other: so, we're wasting time trying to find
> patterns (for compression) in trees. But yeah, eliminating tree
> compression might not help much, because there are so few trees: on a
> typical repository, there are very few renames.

There are a lot of trees in a typical repository. The breakdown of
objects in the kernel is:

  blob   992707 (30%)
  commit 417959 (13%)
  tree  1898643 (57%)

The reason there are more trees than blobs is that the sha1 changes
cascade through the hierarchy. If you put a new blob in foo/bar/baz,
we need to update the foo/bar, foo, and root trees.

I think dropping tree compression _does_ help; it's just that zlib
inflation is already reasonably fast, and our time is dominated by the
hash table performance. I posted some numbers a while back here:

  http://article.gmane.org/gmane.comp.version-control.git/212278

It uses about 3% more disk space, but "--objects" traversals are about
10-20% faster.

I didn't measure the time to _create_ packs, though. We would be
avoiding zlib deflate there, which is much more expensive than inflate.
Still, I don't think that zlib is all that big a deal when creating
packs; most of the time goes to the traversal and to looking for deltas.

-Peff

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] git exproll: steps to tackle gc aggression
  2013-08-10  9:24                           ` Duy Nguyen
  2013-08-10  9:28                             ` Duy Nguyen
@ 2013-08-10  9:43                             ` Jeff King
  2013-08-10  9:50                               ` Duy Nguyen
  1 sibling, 1 reply; 33+ messages in thread
From: Jeff King @ 2013-08-10  9:43 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Ramkumar Ramachandra, Junio C Hamano, Martin Fick, Git List

On Sat, Aug 10, 2013 at 04:24:48PM +0700, Nguyen Thai Ngoc Duy wrote:

> > Yeah, I read about the --thin option. However, it's only for
> > network-packs (i.e --stdout; why?). Also, is it turned on by default?
> > The documentation says so, but I ran it and found that the value of
> > thin is 0 in builtin/push.c:316. What is going on?
> 
> --thin is enabled by default for fetch (see
> transport.c:transport_get()) but it's only effective when the server
> advertises "thin-pack" capability (see protocol-capabilities.txt).
> push has --thin turned off by default favoring server resources over
> network traffic, see a4503a1 (Make --no-thin the default in git-push
> to save server resources - 2007-09-09)

Hmm. I don't think that is the case anymore.

If I do:

  git init parent &&
  (cd parent && seq 1 10000 >file &&
   git add file && git commit -m base
  ) &&
  git clone parent child &&
  cd child && seq 1 10001 >file &&
  git commit -a -m more &&
  GIT_TRACE=1 git push origin HEAD:foo

I see:

  trace: run_command: 'pack-objects' '--all-progress-implied' '--revs'
    '--stdout' '--thin' '--delta-base-offset' '--progress'

-Peff

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] git exproll: steps to tackle gc aggression
  2013-08-10  1:24                         ` Duy Nguyen
@ 2013-08-10  9:50                           ` Jeff King
  0 siblings, 0 replies; 33+ messages in thread
From: Jeff King @ 2013-08-10  9:50 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Ramkumar Ramachandra, Junio C Hamano, Martin Fick, Git List

On Sat, Aug 10, 2013 at 08:24:39AM +0700, Nguyen Thai Ngoc Duy wrote:

> > the other end cannot use). You _might_ be able to get by with a kind of
> > "two-level" hack: consider your main pack as "group A" and newly pushed
> > packs as "group B". Allow storing thin deltas on disk from group B
> > against group A, but never the reverse (nor within group B). That makes
> > sure you don't have cycles, and it eliminates even more I/O than any
> > repacking solution (because you never write the extra copy of Y to disk
> > in the first place). But I can think of two problems:
> [...]
> 
> Some refinements on this idea
> 
>  - We could keep packs in group B ordered as the packs come in. The
> new pack can depend on the previous ones.

I think you could dispense with the two-level altogether and simply give
a definite ordering to packs, whereby newer packs can only depend on
older packs. Enforcing that with filesystem mtime feels a bit
error-prone; I think you'd want to explicitly store a counter somewhere.

>  - A group index in addition to separate index for each pack would
> solve linear search object lookup problem.

Yeah. I do not even think it would be that much work. It is a pure
optimization, so you can ignore issues like "what happens if I search
for an object, but the pack it is supposed to be in went away?". The
answer is "you fall back to a linear search through the packs", and
assume it happens infrequently enough not to care.

I'd wait to see how other proposed optimizations work out before doing a
global index, though.  The current wisdom is "don't have a ton of packs,
for both the index issue and other reasons, like wasting space and
on-the-fly deltas for fetches". If the "other reasons" go away, then a
global index would make sense to solve the remaining issue. But if the
solution for the other issues is to make it cheaper to repack so you can
do it more often, then the index issue just goes away.

-Peff

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] git exproll: steps to tackle gc aggression
  2013-08-10  9:43                             ` Jeff King
@ 2013-08-10  9:50                               ` Duy Nguyen
  2013-08-10 10:05                                 ` Ramkumar Ramachandra
  0 siblings, 1 reply; 33+ messages in thread
From: Duy Nguyen @ 2013-08-10  9:50 UTC (permalink / raw)
  To: Jeff King; +Cc: Ramkumar Ramachandra, Junio C Hamano, Martin Fick, Git List

On Sat, Aug 10, 2013 at 4:43 PM, Jeff King <peff@peff.net> wrote:
>> push has --thin turned off by default favoring server resources over
>> network traffic, see a4503a1 (Make --no-thin the default in git-push
>> to save server resources - 2007-09-09)
>
> Hmm. I don't think that is the case anymore.
>
> If I do:
>
>   git init parent &&
>   (cd parent && seq 1 10000 >file &&
>    git add file && git commit -m base
>   ) &&
>   git clone parent child &&
>   cd child && seq 1 10001 >file &&
>   git commit -a -m more &&
>   GIT_TRACE=1 git push origin HEAD:foo
>
> I see:
>
>   trace: run_command: 'pack-objects' '--all-progress-implied' '--revs'
>     '--stdout' '--thin' '--delta-base-offset' '--progress'
>

Right. transport_get() is also run for push and it sets
smart_options->thin = 1 unconditionally. Thanks for correcting.
-- 
Duy

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] git exproll: steps to tackle gc aggression
  2013-08-10  9:50                               ` Duy Nguyen
@ 2013-08-10 10:05                                 ` Ramkumar Ramachandra
  2013-08-10 10:16                                   ` Duy Nguyen
  0 siblings, 1 reply; 33+ messages in thread
From: Ramkumar Ramachandra @ 2013-08-10 10:05 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Jeff King, Junio C Hamano, Martin Fick, Git List

Duy Nguyen wrote:
> Right. transport_get() is also run for push and it sets
> smart_options->thin = 1 unconditionally.

So, it looks like smart http implies the thin capability. I think
sop's patch (6 years ago) was about turning off thin on dumb http: can
we check that before deciding that push --[no-]thin is historical
cruft?

Also, a documentation update is required.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] git exproll: steps to tackle gc aggression
  2013-08-10 10:05                                 ` Ramkumar Ramachandra
@ 2013-08-10 10:16                                   ` Duy Nguyen
  0 siblings, 0 replies; 33+ messages in thread
From: Duy Nguyen @ 2013-08-10 10:16 UTC (permalink / raw)
  To: Ramkumar Ramachandra; +Cc: Jeff King, Junio C Hamano, Martin Fick, Git List

On Sat, Aug 10, 2013 at 5:05 PM, Ramkumar Ramachandra
<artagnon@gmail.com> wrote:
> Duy Nguyen wrote:
>> Right. transport_get() is also run for push and it sets
>> smart_options->thin = 1 unconditionally.
>
> So, it looks like smart http implies the thin capability.

"smart_options" is a bit misleading. This applies to both smart
http://, git:// and ssh://

> I think
> sop's patch (6 years ago) was about turning off thin on dumb http

No, dumb http walks through the remote repository object database,
there's no temporary pack created for transferring data. Dumb http has
nothing to do with this flag.
-- 
Duy

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2013-08-10 10:16 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-08-06  2:38 [PATCH] git exproll: steps to tackle gc aggression Ramkumar Ramachandra
2013-08-06 12:24 ` Duy Nguyen
2013-08-06 17:39   ` Junio C Hamano
2013-08-07  4:43     ` Ramkumar Ramachandra
2013-08-08  7:13       ` Junio C Hamano
2013-08-08  7:44         ` Ramkumar Ramachandra
2013-08-08 16:56           ` Junio C Hamano
2013-08-08 17:34             ` Martin Fick
2013-08-08 18:52               ` Junio C Hamano
2013-08-08 19:14                 ` Ramkumar Ramachandra
2013-08-08 17:36             ` Ramkumar Ramachandra
2013-08-08 19:37               ` Junio C Hamano
2013-08-08 20:04                 ` Ramkumar Ramachandra
2013-08-08 21:09                   ` Martin Langhoff
2013-08-09 11:00                   ` Jeff King
2013-08-09 13:34                     ` Ramkumar Ramachandra
2013-08-09 17:35                       ` Junio C Hamano
2013-08-09 22:16                       ` Jeff King
2013-08-10  1:24                         ` Duy Nguyen
2013-08-10  9:50                           ` Jeff King
2013-08-10  5:26                         ` Junio C Hamano
2013-08-10  8:42                         ` Ramkumar Ramachandra
2013-08-10  9:24                           ` Duy Nguyen
2013-08-10  9:28                             ` Duy Nguyen
2013-08-10  9:43                             ` Jeff King
2013-08-10  9:50                               ` Duy Nguyen
2013-08-10 10:05                                 ` Ramkumar Ramachandra
2013-08-10 10:16                                   ` Duy Nguyen
2013-08-10  9:38                           ` Jeff King
2013-08-07  0:10   ` Martin Fick
2013-08-08  2:18     ` Duy Nguyen
2013-08-07  0:25 ` Martin Fick
2013-08-07  4:36   ` Ramkumar Ramachandra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.