All of lore.kernel.org
 help / color / mirror / Atom feed
From: Christian Couder <christian.couder@gmail.com>
To: ZheNing Hu <adlternative@gmail.com>
Cc: Git List <git@vger.kernel.org>,
	Junio C Hamano <gitster@pobox.com>,
	Hariom verma <hariom18599@gmail.com>, Jeff King <peff@peff.net>
Subject: Re: [GSoC] Git Blog 4
Date: Mon, 14 Jun 2021 10:02:19 +0200	[thread overview]
Message-ID: <CAP8UFD0jiZuPvO-oYXw9PmVQ56tpYc9nxUxAjPQrc2f1qwEqUQ@mail.gmail.com> (raw)
In-Reply-To: <CAOLTT8QHL-6-DxoRKtx5cVp_DePxtWYU4CuBweYfCG1hGZZhaA@mail.gmail.com>

On Sun, Jun 13, 2021 at 4:17 PM ZheNing Hu <adlternative@gmail.com> wrote:

> In addition, some scripts like `printf "%b" "a\0b\0c" >blob1` will
> be truncated at the first NUL on a 32-bit machine, but it performs
> well on 64-bit machines, and NUL is normally stored in the file.
> This made me think that Git's file decompression had an error on
> a 32-bit machine before I used Ubuntu32's docker container to
> clone the git repository and In-depth analysis of bugs... In the end,
> I used `printf "a\0b\0c"` to make 32-bit machines not truncated
> in NUL. Is there a better way to write binary data onto a file than
> `printf` and `echo`?

You might want to take a look at t/t4058-diff-duplicates.sh which has
the following:

# make_tree_entry <mode> <mode> <sha1>
#
# We have to rely on perl here because not all printfs understand
# hex escapes (only octal), and xxd is not portable.
make_tree_entry () {
       printf '%s %s\0' "$1" "$2" &&
       perl -e 'print chr(hex($_)) for ($ARGV[0] =~ /../g)' "$3"
}

> Since I am a newbie to docker, I would like to know if there is any
> way to run the Git's Github CI program remotely or locally?

There are scripts in the ci/ directory, but yeah it could help if
there was a README there.

> In the second half of this week, I tried to make `cat-file` reuse the
> logic of `ref-filter`. I have to say that this is a very difficult process.
> "rebase -i" again and again to repair the content of previous commits.
> squeeze commits, split commits, modify commit messages... Finally, I
> submitted the patches to the Git mailing list in
> [[PATCH 0/8] [GSOC][RFC] cat-file: reuse `ref-filter`
> logic](https://lore.kernel.org/git/pull.980.git.1623496458.gitgitgadget@gmail.com/).
> Now `cat-file` has learned most of the atoms in `ref-filter`. I am very
> happy to be able to make git support richer functions through my own code.
>
> Regrettably, `git cat-file --batch --batch-all-objects` seems to take up
> a huge amount of memory on a large repo such as git.git, and it will
> be killed by Linux's oom.

In the cover letter of your patch series you say:

"There is still an unresolved issue: performance overhead is very large, so
that when we use:

git cat-file --batch --batch-all-objects >/dev/null

on git.git, it may fail."

Is this the same issue? Is it only a memory issue, or is your patch
series also making things slower?

> This is mainly because we will make a large
> number of copies of the object's raw data. The original `git cat-file`
> uses `read_object_file()` or `stream_blob()` to output the object's
> raw data, but in `ref-filter`, we have to use `v->s` to copy the object's
> data, it is difficult to eliminate `v->s` and print the output directly to the
> final output buffer. Because we may have atoms like `%(if)`, `%(else)`
> that need to use buffers on the stack to build the final output string
> layer by layer,

What does layer by layer mean here?

> or the `cmp_ref_sorting()` needs to use `v->s` to
> compare two refs. In short, it is very difficult for `ref-filter` to reduce
> copy overhead. I even thought about using the string pool API
> `memintern()` to replace `xmemdupz()`, but it seems that the effect
> is not obvious. A large number of objects' data will still reside in memory,
> so this may not be a good method.

Would it be possible to keep the data for a limited number of objects,
then print everything related to these objects, free their data and
start again with another limited number of objects?

> Anyway, stay confident. I can solve these difficult problems with
> the help of mentors and reviewers. `:)`

Sure :-)

  parent reply	other threads:[~2021-06-14  8:03 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-13 14:17 [GSoC] Git Blog 4 ZheNing Hu
2021-06-13 23:28 ` Eric Sunshine
2021-06-14  3:41   ` ZheNing Hu
2021-06-14  8:02 ` Christian Couder [this message]
2021-06-14 12:02   ` Christian Couder
2021-06-15  8:59   ` ZheNing Hu
2021-06-15 12:30     ` ZheNing Hu
2021-06-14 13:20 ` Atharva Raykar
2021-06-15  9:06   ` ZheNing Hu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAP8UFD0jiZuPvO-oYXw9PmVQ56tpYc9nxUxAjPQrc2f1qwEqUQ@mail.gmail.com \
    --to=christian.couder@gmail.com \
    --cc=adlternative@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=hariom18599@gmail.com \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.