All of lore.kernel.org
 help / color / mirror / Atom feed
* git-fast-import
@ 2007-02-06  2:31 Shawn O. Pearce
  2007-02-06  3:18 ` git-fast-import Nicolas Pitre
                   ` (6 more replies)
  0 siblings, 7 replies; 52+ messages in thread
From: Shawn O. Pearce @ 2007-02-06  2:31 UTC (permalink / raw)
  To: git

I'm starting to get gfi (git-fast-import) prepared for a merge into
the main git.git tree.  For those who don't know, gfi is the result
of my work with Jon Smirl on trying to *quickly* import the massive
Mozilla CVS repository into Git.  Recently its been getting a lot
of attention from the KDE, OOo, Dragonfly BSD, and Qt projects.

When exactly we merge it in will depend a lot on Junio.  It should
be safe to merge before 1.5.0 as its strictly new source files,
but we may still want to wait until after 1.5.0 is out.

I'm mainly worried about breaking compliation on odd architectures.
gfi builds, runs and has been used for production level imports
on Mac OS X, Linux and Dragonfly BSD, using both 32 bit and 64 bit
architectures, but some of Git's other targets (e.g. AIX) haven't
seen any testing.

The gfi code is quite stable and has been getting a lot of use
(and discussion) lately.  A new test (t/t9300-fast-import.sh)
has been added and now, finally, documentation
(Documentation/git-fast-import.txt).

As gfi is 1962 lines of C and its development history consists of
74 commits made over the span of 7 months (first commit was Aug 5,
2006) and several versions of core Git code (which gfi calls into,
and which has gone through some non-trivial changes during that
time), I'm going to ask Junio to directly pull the topic branch
into git.git, rather than submitting it as patches.

My topic branch is published on repo.or.cz (thanks Pasky!).  I would
encourage all parties who would have otherwise been interested in
reviewing the patches on the mailing list to clone/fetch the topic
and review it locally instead.

	gitweb: http://repo.or.cz/w/git/fastimport.git
	clone:  git://repo.or.cz/git/fastimport.git

I'm particularly interested in feedback on the documentation,
so I am attaching it below.

-------
git-fast-import(1)
==================

NAME
----
git-fast-import - Backend for fast Git data importers.


SYNOPSIS
--------
frontend | 'git-fast-import' [options]

DESCRIPTION
-----------
This program is usually not what the end user wants to run directly.
Most end users want to use one of the existing frontend programs,
which parses a specific type of foreign source and feeds the contents
stored there to git-fast-import (gfi).

gfi reads a mixed command/data stream from standard input and
writes one or more packfiles directly into the current repository.
When EOF is received on standard input, fast import writes out
updated branch and tag refs, fully updating the current repository
with the newly imported data.

The gfi backend itself can import into an empty repository (one that
has already been initialized by gitlink:git-init[1]) or incrementally
update an existing populated repository.  Whether or not incremental
imports are supported from a particular foreign source depends on
the frontend program in use.


OPTIONS
-------
--max-pack-size=<n>::
	Maximum size of each output packfile, expressed in MiB.
	The default is 4096 (4 GiB) as that is the maximum allowed
	packfile size (due to file format limitations). Some
	importers may wish to lower this, such as to ensure the
	resulting packfiles fit on CDs.

--depth=<n>::
	Maximum delta depth, for blob and tree deltification.
	Default is 10.

--active-branches=<n>::
	Maximum number of branches to maintain active at once.
	See ``Memory Utilization'' below for details.  Default is 5.

--export-marks=<file>::
	Dumps the internal marks table to <file> when complete.
	Marks are written one per line as `:markid SHA-1`.
	Frontends can use this file to validate imports after they
	have been completed.

--branch-log=<file>::
	Records every tag and commit made to a log file.  (This file
	can be quite verbose on large imports.)  This particular
	option has been primarily intended to facilitate debugging
	gfi and has limited usefulness in other contexts.  It may
	be removed in future versions.


Performance
-----------
The design of gfi allows it to import large projects in a minimum
amount of memory usage and processing time.  Assuming the frontend
is able to keep up with gfi and feed it a constant stream of data,
import times for projects holding 10+ years of history and containing
100,000+ individual commits are generally completed in just 1-2
hours on quite modest (~$2,000 USD) hardware.

Most bottlenecks appear to be in foreign source data access (the
source just cannot extract revisions fast enough) or disk IO (gfi
writes as fast as the disk will take the data).  Imports will run
faster if the source data is stored on a different drive than the
destination Git repository (due to less IO contention).


Development Cost
----------------
A typical frontend for gfi tends to weigh in at approximately 200
lines of Perl/Python/Ruby code.  Most developers have been able to
create working importers in just a couple of hours, even though it
is their first exposure to gfi, and sometimes even to Git.  This is
an ideal situation, given that most conversion tools are throw-away
(use once, and never look back).


Parallel Operation
------------------
Like `git-push` or `git-fetch`, imports handled by gfi are safe to
run alongside parallel `git repack -a -d` or `git gc` invocations,
or any other Git operation (including `git prune`, as loose objects
are never used by gfi).

However, gfi does not lock the branch or tag refs it is actively
importing.  After EOF, during its ref update phase, gfi blindly
overwrites each imported branch or tag ref.  Consequently it is not
safe to modify refs that are currently being used by a running gfi
instance, as work could be lost when gfi overwrites the refs.


Technical Discussion
--------------------
gfi tracks a set of branches in memory.  Any branch can be created
or modified at any point during the import process by sending a
`commit` command on the input stream.  This design allows a frontend
program to process an unlimited number of branches simultaneously,
generating commits in the order they are available from the source
data.  It also simplifies the frontend programs considerably.

gfi does not use or alter the current working directory, or any
file within it.  (It does however update the current Git repository,
as referenced by `GIT_DIR`.)  Therefore an import frontend may use
the working directory for its own purposes, such as extracting file
revisions from the foreign source.  This ignorance of the working
directory also allows gfi to run very quickly, as it does not
need to perform any costly file update operations when switching
between branches.

Input Format
------------
With the exception of raw file data (which Git does not interpret)
the gfi input format is text (ASCII) based.  This text based
format simplifies development and debugging of frontend programs,
especially when a higher level language such as Perl, Python or
Ruby is being used.

gfi is very strict about its input.  Where we say SP below we mean
*exactly* one space.  Likewise LF means one (and only one) linefeed.
Supplying additional whitespace characters will cause unexpected
results, such as branch names or file names with leading or trailing
spaces in their name, or early termination of gfi when it encounters
unexpected input.

Commands
~~~~~~~~
gfi accepts several commands to update the current repository
and control the current import process.  More detailed discussion
(with examples) of each command follows later.

`commit`::
	Creates a new branch or updates an existing branch by
	creating a new commit and updating the branch to point at
	the newly created commit.

`tag`::
	Creates an annotated tag object from an existing commit or
	branch.  Lightweight tags are not supported by this command,
	as they are not recommended for recording meaningful points
	in time.

`reset`::
	Reset an existing branch (or a new branch) to a specific
	revision.  This command must be used to change a branch to
	a specific revision without making a commit on it.

`blob`::
	Convert raw file data into a blob, for future use in a
	`commit` command.  This command is optional and is not
	needed to perform an import.

`checkpoint`::
	Forces gfi to close the current packfile, generate its
	unique SHA-1 checksum and index, and start a new packfile.
	This command is optional and is not needed to perform
	an import.

`commit`
~~~~~~~~
Create or update a branch with a new commit, recording one logical
change to the project.

....
	'commit' SP <ref> LF
	mark?
	('author' SP <name> SP LT <email> GT SP <time> SP <tz> LF)?
	'committer' SP <name> SP LT <email> GT SP <time> SP <tz> LF
	data
	('from' SP <committish> LF)?
	('merge' SP <committish> LF)?
	(filemodify | filedelete)*
	LF
....

where `<ref>` is the name of the branch to make the commit on.
Typically branch names are prefixed with `refs/heads/` in
Git, so importing the CVS branch symbol `RELENG-1_0` would use
`refs/heads/RELENG-1_0` for the value of `<ref>`.  The value of
`<ref>` must be a valid refname in Git.  As `LF` is not valid in
a Git refname, no quoting or escaping syntax is supported here.

A `mark` command may optionally appear, requesting gfi to save a
reference to the newly created commit for future use by the frontend
(see below for format).  It is very common for frontends to mark
every commit they create, thereby allowing future branch creation
from any imported commit.

The `data` command following `committer` must supply the commit
message (see below for `data` command syntax).  To import an empty
commit message use a 0 length data.  Commit messages are free-form
and are not interpreted by Git.  Currently they must be encoded in
UTF-8, as gfi does not permit other encodings to be specified.

Zero or more `filemodify` and `filedelete` commands may be
included to update the contents of the branch prior to the commit.
These commands can be supplied in any order, gfi is not sensitive
to pathname or operation ordering.

`author`
^^^^^^^^
An `author` command may optionally appear, if the author information
might differ from the committer information.  If `author` is omitted
then gfi will automatically use the committer's information for
the author portion of the commit.  See below for a description of
the fields in `author`, as they are identical to `committer`.

`committer`
^^^^^^^^^^^
The `committer` command indicates who made this commit, and when
they made it.

Here `<name>` is the person's display name (for example
``Com M Itter'') and `<email>` is the person's email address
(``cm@example.com'').  `LT` and `GT` are the literal less-than (\x3c)
and greater-than (\x3e) symbols.  These are required to delimit
the email address from the other fields in the line.  Note that
`<name>` is free-form and may contain any sequence of bytes, except
`LT` and `LF`.  It is typically UTF-8 encoded.

The time of the change is specified by `<time>` as the number of
seconds since the UNIX epoc (midnight, Jan 1, 1970, UTC) and is
written in base-10 notation using US-ASCII digits.  The committer's
timezone is specified by `<tz>` as a positive or negative offset
from UTC, in minutes.  For example EST would be expressed in `<tz>`
by ``-0500''.

`from`
^^^^^^
Only valid for the first commit made on this branch by this
gfi process.  The `from` command is used to specify the commit
to initialize this branch from.  This revision will be the first
ancestor of the new commit.

Omitting the `from` command in the first commit of a new branch will
cause gfi to create that commit with no ancestor. This tends to be
desired only for the initial commit of a project.  Omitting the
`from` command on existing branches is required, as the current
commit on that branch is automatically assumed to be the first
ancestor of the new commit.

As `LF` is not valid in a Git refname or SHA-1 expression, no
quoting or escaping syntax is supported within `<committish>`.

Here `<committish>` is any of the following:

* The name of an existing branch already in gfi's internal branch
  table.  If gfi doesn't know the name, its treated as a SHA-1
  expression.

* A mark reference, `:<idnum>`, where `<idnum>` is the mark number.
+
The reason gfi uses `:` to denote a mark reference is this character
is not legal in a Git branch name.  The leading `:` makes it easy
to distingush between the mark 42 (`:42`) and the branch 42 (`42`
or `refs/heads/42`), or an abbreviated SHA-1 which happened to
consist only of base-10 digits.
+
Marks must be declared (via `mark`) before they can be used.

* A complete 40 byte or abbreviated commit SHA-1 in hex.

* Any valid Git SHA-1 expression that resolves to a commit.  See
  ``SPECIFYING REVISIONS'' in gitlink:git-rev-parse[1] for details.

The special case of restarting an incremental import from the
current branch value should be written as:
----
	from refs/heads/branch^0
----
The `^0` suffix is necessary as gfi does not permit a branch to
start from itself, and the branch is created in memory before the
`from` command is even read from the input.  Adding `^0` will force
gfi to resolve the commit through Git's revision parsing library,
rather than its internal branch table, thereby loading in the
existing value of the branch.

`merge`
^^^^^^^
Includes one additional ancestor commit, and makes the current
commit a merge commit.  An unlimited number of `merge` commands per
commit are permitted by gfi, thereby establishing an n-way merge.
However Git's other tools never create commits with more than 15
additional ancestors (forming a 16-way merge).  For this reason
it is suggested that frontends do not use more than 15 `merge`
commands per commit.

Here `<committish>` is any of the commit specification expressions
also accepted by `from` (see above).

`filemodify`
^^^^^^^^^^
Included in a `commit` command to add a new file or change the
content of an existing file.  This command has two different means
of specifying the content of the file.

External data format::
	The data content for the file was already supplied by a prior
	`blob` command.  The frontend just needs to connect it.
+
....
	'M' SP <mode> SP <dataref> SP <path> LF
....
+
Here `<dataref>` can be either a mark reference (`:<idnum>`)
set by a prior `blob` command, or a full 40-byte SHA-1 of an
existing Git blob object.

Inline data format::
	The data content for the file has not been supplied yet.
	The frontend wants to supply it as part of this modify
	command.
+
....
	'M' SP <mode> SP 'inline' SP <path> LF
	data
....
+
See below for a detailed description of the `data` command.

In both formats `<mode>` is the type of file entry, specified
in octal.  Git only supports the following modes:

* `100644` or `644`: A normal (not-executable) file.  The majority
  of files in most projects use this mode.  If in doubt, this is
  what you want.
* `100755` or `755`: A normal, but executable, file.
* `140000`: A symlink, the content of the file will be the link target.

In both formats `<path>` is the complete path of the file to be added
(if not already existing) or modified (if already existing).

A `<path>` string must use UNIX-style directory seperators (forward
slash `/`), may contain any byte other than `LF`, and must not
start with double quote (`"`).

If an `LF` or double quote must be encoded into `<path>` shell-style
quoting should be used, e.g. `"path/with\n and \" in it"`.

The value of `<path>` must be in canoncial form. That is it must not:

* contain an empty directory component (e.g. `foo//bar` is invalid),
* end with a directory seperator (e.g. `foo/` is invalid),
* start with a directory seperator (e.g. `/foo` is invalid),
* contain the special component `.` or `..` (e.g. `foo/./bar` and
  `foo/../bar` are invalid).

It is recommended that `<path>` always be encoded using UTF-8.


`filedelete`
^^^^^^^^^^
Included in a `commit` command to remove a file from the branch.
If the file removal makes its directory empty, the directory will
be automatically removed too.  This cascades up the tree until the
first non-empty directory or the root is reached.

....
	'D' SP <path> LF
....

here `<path>` is the complete path of the file to be removed.
See `filemodify` above for a detailed description of `<path>`.

`mark`
~~~~~~
Arranges for gfi to save a reference to the current object, allowing
the frontend to recall this object at a future point in time, without
knowing its SHA-1.  Here the current object is the object creation
command the `mark` command appears within.  This can be `commit`,
`tag`, and `blob`, but `commit` is the most common usage.

....
	'mark' SP ':' <idnum> LF
....

where `<idnum>` is the number assigned by the frontend to this mark.
The value of `<idnum>` is expressed in base 10 notation using
US-ASCII digits.  The value 0 is reserved and cannot be used as
a mark.  Only values greater than or equal to 1 may be used as marks.

New marks are created automatically.  Existing marks can be moved
to another object simply by reusing the same `<idnum>` in another
`mark` command.

`tag`
~~~~~
Creates an annotated tag referring to a specific commit.  To create
lightweight (non-annotated) tags see the `reset` command below.

....
	'tag' SP <name> LF
	'from' SP <committish> LF
	'tagger' SP <name> SP LT <email> GT SP <time> SP <tz> LF
	data
	LF
....

where `<name>` is the name of the tag to create.

Tag names are automatically prefixed with `refs/tags/` when stored
in Git, so importing the CVS branch symbol `RELENG-1_0-FINAL` would
use just `RELENG-1_0-FINAL` for `<name>`, and gfi will write the
corresponding ref as `refs/tags/RELENG-1_0-FINAL`.

The value of `<name>` must be a valid refname in Git and therefore
may contain forward slashes.  As `LF` is not valid in a Git refname,
no quoting or escaping syntax is supported here.

The `from` command is the same as in the `commit` command; see
above for details.

The `tagger` command uses the same format as `committer` within
`commit`; again see above for details.

The `data` command following `tagger` must supply the annotated tag
message (see below for `data` command syntax).  To import an empty
tag message use a 0 length data.  Tag messages are free-form and are
not interpreted by Git.  Currently they must be encoded in UTF-8,
as gfi does not permit other encodings to be specified.

Signing annotated tags during import from within gfi is not
supported.  Trying to include your own PGP/GPG signature is not
recommended, as the frontend does not (easily) have access to the
complete set of bytes which normally goes into such a signature.
If signing is required, create lightweight tags from within gfi with
`reset`, then create the annotated versions of those tags offline
with the standard gitlink:git-tag[1] process.

`reset`
~~~~~~~
Creates (or recreates) the named branch, optionally starting from
a specific revision.  The reset command allows a frontend to issue
a new `from` command for an existing branch, or to create a new
branch from an existing commit without creating a new commit.

....
	'reset' SP <ref> LF
	('from' SP <committish> LF)?
	LF
....

For a detailed description of `<ref>` and `<committish>` see above
under `commit` and `from`.

The `reset` command can also be used to create lightweight
(non-annotated) tags.  For example:

====
	reset refs/tags/938
	from :938
====

would create the lightweight tag `refs/tags/938` referring to
whatever commit mark `:938` references.

`blob`
~~~~~~
Requests writing one file revision to the packfile.  The revision
is not connected to any commit; this connection must be formed in
a subsequent `commit` command by referencing the blob through an
assigned mark.

....
	'blob' LF
	mark?
	data
....

The mark command is optional here as some frontends have chosen
to generate the Git SHA-1 for the blob on their own, and feed that
directly to `commit`.  This is typically more work than its worth
however, as marks are inexpensive to store and easy to use.

`data`
~~~~~~
Supplies raw data (for use as blob/file content, commit messages, or
annotated tag messages) to gfi.  Data can be supplied using an exact
byte count or delimited with a terminating line.  Real frontends
intended for production-quality conversions should always use the
exact byte count format, as it is more robust and performs better.
The delimited format is intended primarily for testing gfi.

Exact byte count format:

....
	'data' SP <count> LF
	<raw> LF
....

where `<count>` is the exact number of bytes appearing within
`<raw>`.  The value of `<count>` is expressed in base 10 notation
using US-ASCII digits.  The `LF` on either side of `<raw>` is not
included in `<count>` and will not be included in the imported data.

Delimited format:

....
	'data' SP '<<' <delim> LF
	<raw> LF
	<delim> LF
....

where `<delim>` is the chosen delimiter string.  The string `<delim>`
must not appear on a line by itself within `<raw>`, as otherwise
gfi will think the data ends earlier than it really does.  The `LF`
immediately trailing `<raw>` is part of `<raw>`.  This is one of
the limitations of the delimited format, it is impossible to supply
a data chunk which does not have an LF as its last byte.

`checkpoint`
~~~~~~~~~~~~
Forces gfi to close the current packfile and start a new one.
As this requires a significant amount of CPU time and disk IO
(to compute the overall pack SHA-1 checksum and generate the
corresponding index file) it can easily take several minutes for
a single `checkpoint` command to complete.

....
	'checkpoint' LF
	LF
....

Packfile Optimization
---------------------
When packing a blob gfi always attempts to deltify against the last
blob written.  Unless specifically arranged for by the frontend,
this will probably not be a prior version of the same file, so the
generated delta will not be the smallest possible.  The resulting
packfile will be compressed, but will not be optimal.

Frontends which have efficient access to all revisions of a
single file (for example reading an RCS/CVS ,v file) can choose
to supply all revisions of that file as a sequence of consecutive
`blob` commands.  This allows gfi to deltify the different file
revisions against each other, saving space in the final packfile.
Marks can be used to later identify individual file revisions during
a sequence of `commit` commands.

The packfile(s) created by gfi do not encourage good disk access
patterns.  This is caused by gfi writing the data in the order
it is received on standard input, while Git typically organizes
data within packfiles to make the most recent (current tip) data
appear before historical data.  Git also clusters commits together,
speeding up revision traversal through better cache locality.

For this reason it is strongly recommended that users repack the
repository with `git repack -a -d` after gfi completes, allowing
Git to reorganize the packfiles for faster data access.  If blob
deltas are suboptimal (see above) then also adding the `-f` option
to force recomputation of all deltas can significantly reduce the
final packfile size (30-50% smaller can be quite typical).

Memory Utilization
------------------
There are a number of factors which affect how much memory gfi
requires to perform an import.  Like critical sections of core
Git, gfi uses its own memory allocators to ammortize any overheads
associated with malloc.  In practice gfi tends to ammoritize any
malloc overheads to 0, due to its use of large block allocations.

per object
~~~~~~~~~~
gfi maintains an in-memory structure for every object written in
this execution.  On a 32 bit system the structure is 32 bytes,
on a 64 bit system the structure is 40 bytes (due to the larger
pointer sizes).  Objects in the table are not deallocated until
gfi terminates.  Importing 2 million objects on a 32 bit system
will require approximately 64 MiB of memory.

The object table is actually a hashtable keyed on the object name
(the unique SHA-1).  This storage configuration allows gfi to reuse
an existing or already written object and avoid writing duplicates
to the output packfile.  Duplicate blobs are surprisingly common
in an import, typically due to branch merges in the source.

per mark
~~~~~~~~
Marks are stored in a sparse array, using 1 pointer (4 bytes or 8
bytes, depending on pointer size) per mark.  Although the array
is sparse, frontends are still strongly encouraged to use marks
between 1 and n, where n is the total number of marks required for
this import.

per branch
~~~~~~~~~~
Branches are classified as active and inactive.  The memory usage
of the two classes is significantly different.

Inactive branches are stored in a structure which uses 96 or 120
bytes (32 bit or 64 bit systems, respectively), plus the length of
the branch name (typically under 200 bytes), per branch.  gfi will
easily handle as many as 10,000 inactive branches in under 2 MiB
of memory.

Active branches have the same overhead as inactive branches, but
also contain copies of every tree that has been recently modified on
that branch.  If subtree `include` has not been modified since the
branch became active, its contents will not be loaded into memory,
but if subtree `src` has been modified by a commit since the branch
became active, then its contents will be loaded in memory.

As active branches store metadata about the files contained on that
branch, their in-memory storage size can grow to a considerable size
(see below).

gfi automatically moves active branches to inactive status based on
a simple least-recently-used algorithm.  The LRU chain is updated on
each `commit` command.  The maximum number of active branches can be
increased or decreased on the command line with `--active-branches=`.

per active tree
~~~~~~~~~~~~~~~
Trees (aka directories) use just 12 bytes of memory on top of the
memory required for their entries (see ``per active file'' below).
The cost of a tree is virtually 0, as its overhead ammortizes out
over the individual file entries.

per active file entry
~~~~~~~~~~~~~~~~~~~~~
Files (and pointers to subtrees) within active trees require 52 or 64
bytes (32/64 bit platforms) per entry.  To conserve space, file and
tree names are pooled in a common string table, allowing the filename
``Makefile'' to use just 16 bytes (after including the string header
overhead) no matter how many times it occurs within the project.

The active branch LRU, when coupled with the filename string pool
and lazy loading of subtrees, allows gfi to efficiently import
projects with 2,000+ branches and 45,114+ files in a very limited
memory footprint (less than 2.7 MiB per active branch).


Author
------
Written by Shawn O. Pearce <spearce@spearce.org>.

Documentation
--------------
Documentation by Shawn O. Pearce <spearce@spearce.org>.

GIT
---
Part of the gitlink:git[7] suite

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git-fast-import
  2007-02-06  2:31 git-fast-import Shawn O. Pearce
@ 2007-02-06  3:18 ` Nicolas Pitre
  2007-02-06  4:06 ` git-fast-import Nicolas Pitre
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 52+ messages in thread
From: Nicolas Pitre @ 2007-02-06  3:18 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: git

On Mon, 5 Feb 2007, Shawn O. Pearce wrote:

> When exactly we merge it in will depend a lot on Junio.  It should
> be safe to merge before 1.5.0 as its strictly new source files,
> but we may still want to wait until after 1.5.0 is out.

For that reason I think it should go in now.

> I'm mainly worried about breaking compliation on odd architectures.

Well, if it doesn't build then just don't make it a fatal build error.  
That won't be worse than not having it included at all.
And if it compiles then consider it as a bonus!


Nicolas

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git-fast-import
  2007-02-06  2:31 git-fast-import Shawn O. Pearce
  2007-02-06  3:18 ` git-fast-import Nicolas Pitre
@ 2007-02-06  4:06 ` Nicolas Pitre
  2007-02-06  5:48   ` git-fast-import Shawn O. Pearce
  2007-02-06  6:12 ` git-fast-import Aneesh Kumar K.V
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 52+ messages in thread
From: Nicolas Pitre @ 2007-02-06  4:06 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: git

On Mon, 5 Feb 2007, Shawn O. Pearce wrote:

> I'm particularly interested in feedback on the documentation,
> so I am attaching it below.
> 
[...]
> 
> The time of the change is specified by `<time>` as the number of
> seconds since the UNIX epoc (midnight, Jan 1, 1970, UTC) and is
> written in base-10 notation using US-ASCII digits.  The committer's
> timezone is specified by `<tz>` as a positive or negative offset
> from UTC, in minutes.  For example EST would be expressed in `<tz>`
> by ``-0500''.

I think this is quite error prone, demonstrated by the fact that we 
screwed that up ourselves on a few occasions.  I think that the frontend 
should be relieved from this by letting it provide the time of change in 
a more natural format amongst all possible ones(like RFC2822 for 
example) and gfi should simply give it to parse_date().

Otherwise I think this is pretty nice.


Nicolas

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git-fast-import
  2007-02-06  4:06 ` git-fast-import Nicolas Pitre
@ 2007-02-06  5:48   ` Shawn O. Pearce
  2007-02-06 16:35     ` git-fast-import Linus Torvalds
  0 siblings, 1 reply; 52+ messages in thread
From: Shawn O. Pearce @ 2007-02-06  5:48 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: git

Nicolas Pitre <nico@cam.org> wrote:
> On Mon, 5 Feb 2007, Shawn O. Pearce wrote:
> > The time of the change is specified by `<time>` as the number of
> > seconds since the UNIX epoc (midnight, Jan 1, 1970, UTC) and is
> > written in base-10 notation using US-ASCII digits.  The committer's
> > timezone is specified by `<tz>` as a positive or negative offset
> > from UTC, in minutes.  For example EST would be expressed in `<tz>`
> > by ``-0500''.
> 
> I think this is quite error prone, demonstrated by the fact that we 
> screwed that up ourselves on a few occasions.  I think that the frontend 
> should be relieved from this by letting it provide the time of change in 
> a more natural format amongst all possible ones(like RFC2822 for 
> example) and gfi should simply give it to parse_date().

This is a really good point.  Its a little bit of work to switch
to parse_date(); I'll try to get it done tomorrow night.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git-fast-import
  2007-02-06  2:31 git-fast-import Shawn O. Pearce
  2007-02-06  3:18 ` git-fast-import Nicolas Pitre
  2007-02-06  4:06 ` git-fast-import Nicolas Pitre
@ 2007-02-06  6:12 ` Aneesh Kumar K.V
  2007-02-06  6:18   ` git-fast-import Shawn O. Pearce
  2007-02-06  9:28 ` git-fast-import Andy Parkins
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 52+ messages in thread
From: Aneesh Kumar K.V @ 2007-02-06  6:12 UTC (permalink / raw)
  To: git

Shawn O. Pearce wrote:
> I'm starting to get gfi (git-fast-import) prepared for a merge into
> the main git.git tree.  For those who don't know, gfi is the result
> of my work with Jon Smirl on trying to *quickly* import the massive
> Mozilla CVS repository into Git.  Recently its been getting a lot
> of attention from the KDE, OOo, Dragonfly BSD, and Qt projects.
> 
> When exactly we merge it in will depend a lot on Junio.  It should
> be safe to merge before 1.5.0 as its strictly new source files,
> but we may still want to wait until after 1.5.0 is out.
> 
> 
.....


> SYNOPSIS
> --------
> frontend | 'git-fast-import' [options]
> 

Do we have example frontend  that can be added along with gfi ?

-aneesh

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git-fast-import
  2007-02-06  6:12 ` git-fast-import Aneesh Kumar K.V
@ 2007-02-06  6:18   ` Shawn O. Pearce
  2007-02-07  4:55     ` git-fast-import Daniel Barkalow
  0 siblings, 1 reply; 52+ messages in thread
From: Shawn O. Pearce @ 2007-02-06  6:18 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: git

"Aneesh Kumar K.V" <aneesh.kumar@gmail.com> wrote:
> >SYNOPSIS
> >--------
> >frontend | 'git-fast-import' [options]
> >
> 
> Do we have example frontend  that can be added along with gfi ?

Not yet.  Some frontends are available here on repo.or.cz:

  gitweb: http://repo.or.cz/w/fast-export.git
  clone:  git://repo.or.cz/fast-export.git

But both lack branch support, for example, so they probably aren't
nearly as complete as the existing non-gfi based importers.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git-fast-import
  2007-02-06  2:31 git-fast-import Shawn O. Pearce
                   ` (2 preceding siblings ...)
  2007-02-06  6:12 ` git-fast-import Aneesh Kumar K.V
@ 2007-02-06  9:28 ` Andy Parkins
  2007-02-06  9:40   ` git-fast-import Shawn O. Pearce
  2007-02-06 16:37   ` git-fast-import Linus Torvalds
  2007-02-06  9:34 ` git-fast-import Jakub Narebski
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 52+ messages in thread
From: Andy Parkins @ 2007-02-06  9:28 UTC (permalink / raw)
  To: git; +Cc: Shawn O. Pearce

On Tuesday 2007 February 06 02:31, Shawn O. Pearce wrote:

> The time of the change is specified by `<time>` as the number of
> seconds since the UNIX epoc (midnight, Jan 1, 1970, UTC) and is
> written in base-10 notation using US-ASCII digits.  The committer's
> timezone is specified by `<tz>` as a positive or negative offset
> from UTC, in minutes.  For example EST would be expressed in `<tz>`
> by ``-0500''.

Is <tz> /really/ expressed in minutes?  500 minutes is 8 hours 20 minutes.

I know what you mean, of course; and so would anyone reading it - so I suggest 
just dropping the ", in minutes" - as it's not true.


Andy
-- 
Dr Andy Parkins, M Eng (hons), MIEE
andyparkins@gmail.com

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git-fast-import
  2007-02-06  2:31 git-fast-import Shawn O. Pearce
                   ` (3 preceding siblings ...)
  2007-02-06  9:28 ` git-fast-import Andy Parkins
@ 2007-02-06  9:34 ` Jakub Narebski
  2007-02-06  9:39   ` git-fast-import Shawn O. Pearce
  2007-02-06  9:53 ` git-fast-import Jakub Narebski
  2007-02-06 13:50 ` git-fast-import Alex Riesen
  6 siblings, 1 reply; 52+ messages in thread
From: Jakub Narebski @ 2007-02-06  9:34 UTC (permalink / raw)
  To: git

Shawn O. Pearce wrote:

> --depth=<n>::
>         Maximum delta depth, for blob and tree deltification.
>         Default is 10.

Does it support --window=<n> option?
-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git-fast-import
  2007-02-06  9:34 ` git-fast-import Jakub Narebski
@ 2007-02-06  9:39   ` Shawn O. Pearce
  0 siblings, 0 replies; 52+ messages in thread
From: Shawn O. Pearce @ 2007-02-06  9:39 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git

Jakub Narebski <jnareb@gmail.com> wrote:
> Shawn O. Pearce wrote:
> 
> > --depth=<n>::
> >         Maximum delta depth, for blob and tree deltification.
> >         Default is 10.
> 
> Does it support --window=<n> option?

No.  It probably never will.  I don't see a reason to add it.
Heck, --depth is probably not a great knob to have either.  At
least not in gfi.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git-fast-import
  2007-02-06  9:28 ` git-fast-import Andy Parkins
@ 2007-02-06  9:40   ` Shawn O. Pearce
  2007-02-06 16:37   ` git-fast-import Linus Torvalds
  1 sibling, 0 replies; 52+ messages in thread
From: Shawn O. Pearce @ 2007-02-06  9:40 UTC (permalink / raw)
  To: Andy Parkins; +Cc: git

Andy Parkins <andyparkins@gmail.com> wrote:
> On Tuesday 2007 February 06 02:31, Shawn O. Pearce wrote:
> 
> > The time of the change is specified by `<time>` as the number of
> > seconds since the UNIX epoc (midnight, Jan 1, 1970, UTC) and is
> > written in base-10 notation using US-ASCII digits.  The committer's
> > timezone is specified by `<tz>` as a positive or negative offset
> > from UTC, in minutes.  For example EST would be expressed in `<tz>`
> > by ``-0500''.
> 
> Is <tz> /really/ expressed in minutes?  500 minutes is 8 hours 20 minutes.
> 
> I know what you mean, of course; and so would anyone reading it - so I suggest 
> just dropping the ", in minutes" - as it's not true.

Heh, right you are!

Nico's point about using parse_date() here is a really good one.
I'm going to modify that section of gfi to use parse_date(), which
would change the language here anyway.  I'll try to not to make a
silly mistake such as the above in the updated docs.  :)

-- 
Shawn.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git-fast-import
  2007-02-06  2:31 git-fast-import Shawn O. Pearce
                   ` (4 preceding siblings ...)
  2007-02-06  9:34 ` git-fast-import Jakub Narebski
@ 2007-02-06  9:53 ` Jakub Narebski
  2007-02-06 17:20   ` git-fast-import Shawn O. Pearce
  2007-02-06 13:50 ` git-fast-import Alex Riesen
  6 siblings, 1 reply; 52+ messages in thread
From: Jakub Narebski @ 2007-02-06  9:53 UTC (permalink / raw)
  To: git

Shawn O. Pearce wrote:

> `filemodify`
> ^^^^^^^^^^
[...]
> `filedelete`
> ^^^^^^^^^^

Shouldn't it be:

`filemodify`
^^^^^^^^^^^^

and:

`filedelete`
^^^^^^^^^^^^

-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git-fast-import
  2007-02-06  2:31 git-fast-import Shawn O. Pearce
                   ` (5 preceding siblings ...)
  2007-02-06  9:53 ` git-fast-import Jakub Narebski
@ 2007-02-06 13:50 ` Alex Riesen
  2007-02-06 17:43   ` git-fast-import Shawn O. Pearce
  6 siblings, 1 reply; 52+ messages in thread
From: Alex Riesen @ 2007-02-06 13:50 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: git

On 2/6/07, Shawn O. Pearce <spearce@spearce.org> wrote:
> I'm mainly worried about breaking compliation on odd architectures.
> gfi builds, runs and has been used for production level imports
> on Mac OS X, Linux and Dragonfly BSD, using both 32 bit and 64 bit
> architectures, but some of Git's other targets (e.g. AIX) haven't
> seen any testing.

Compilation errors are the simplest to fix, just send it in.
I have to import lots of data from perforce spaghetti, so I'm very
likely to try it out.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git-fast-import
  2007-02-06  5:48   ` git-fast-import Shawn O. Pearce
@ 2007-02-06 16:35     ` Linus Torvalds
  2007-02-06 16:56       ` git-fast-import Shawn O. Pearce
  0 siblings, 1 reply; 52+ messages in thread
From: Linus Torvalds @ 2007-02-06 16:35 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Nicolas Pitre, git



On Tue, 6 Feb 2007, Shawn O. Pearce wrote:
> Nicolas Pitre <nico@cam.org> wrote:
> >
> > I think this is quite error prone, demonstrated by the fact that we 
> > screwed that up ourselves on a few occasions.  I think that the frontend 
> > should be relieved from this by letting it provide the time of change in 
> > a more natural format amongst all possible ones(like RFC2822 for 
> > example) and gfi should simply give it to parse_date().
> 
> This is a really good point.  Its a little bit of work to switch
> to parse_date(); I'll try to get it done tomorrow night.

Actually, I disagree. We've traditionally have had _less_ bugs with the 
pure integer format than we ever had with RFC2822 format.

The original (first seven days) date format inside git objects was 
rfc2822, and it was *horrible*. Not only does it take time to parse, 
people get it constantly wrong, and it's ambiguous what summer-time means 
etc. It's basically impossible to get anything that is totally repeatable 
from it, and you have to be so lax as to effectively accept even buggy 
input. And yes, buggy input exists.

So I would strongly suggest that gfi keeps to the standard git date format 
which is easy to parse, and totally unambiguous. Yes, you can get it 
wrong, but at least then it's very clear *who* gets it wrong: it's 
whatever feeds data to gfi. If gfi accepts a "soft" format, you get into 
all these gray areas of whether you want to be strictly rfc2822 only, or 
whether you actually want to accept stuff that everybody accepts 
(including the git date functions, that try very hard to turn anything 
sensible into a date). And DST. And odd timezone names, etc etc.

Having a hard format, set in stone, and totally unambiguous, is really a 
good thing. It actually ends up resulting in fewer bugs in the end, 
because it just makes sure that everybody is on the same page.

		Linus

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git-fast-import
  2007-02-06  9:28 ` git-fast-import Andy Parkins
  2007-02-06  9:40   ` git-fast-import Shawn O. Pearce
@ 2007-02-06 16:37   ` Linus Torvalds
  2007-02-06 16:44     ` git-fast-import Shawn O. Pearce
  1 sibling, 1 reply; 52+ messages in thread
From: Linus Torvalds @ 2007-02-06 16:37 UTC (permalink / raw)
  To: Andy Parkins; +Cc: git, Shawn O. Pearce



On Tue, 6 Feb 2007, Andy Parkins wrote:
> 
> Is <tz> /really/ expressed in minutes?  500 minutes is 8 hours 20 minutes.
> 
> I know what you mean, of course; and so would anyone reading it - so I suggest 
> just dropping the ", in minutes" - as it's not true.

Agreed. It _is_ "in minutes", but it's in an oddish human-readable base-60 
format. It's certainly *not* decimal, it's more like "two decimal digits 
encode each base-60 digit in the obvious way".

		Linus

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git-fast-import
  2007-02-06 16:37   ` git-fast-import Linus Torvalds
@ 2007-02-06 16:44     ` Shawn O. Pearce
  2007-02-06 17:24       ` git-fast-import Linus Torvalds
                         ` (2 more replies)
  0 siblings, 3 replies; 52+ messages in thread
From: Shawn O. Pearce @ 2007-02-06 16:44 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andy Parkins, git

Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Tue, 6 Feb 2007, Andy Parkins wrote:
> > 
> > Is <tz> /really/ expressed in minutes?  500 minutes is 8 hours 20 minutes.
> > 
> > I know what you mean, of course; and so would anyone reading it - so I suggest 
> > just dropping the ", in minutes" - as it's not true.
> 
> Agreed. It _is_ "in minutes", but it's in an oddish human-readable base-60 
> format. It's certainly *not* decimal, it's more like "two decimal digits 
> encode each base-60 digit in the obvious way".

What about this language?

	The time of the change is specified by `<time>` as the number of
	seconds since the UNIX epoc (midnight, Jan 1, 1970, UTC) and is
	written in base-10 notation using US-ASCII digits.  The committer's
	timezone is specified by `<tz>` as a positive or negative offset
	from UTC.  For example EST (which is typically 5 hours behind GMT)
	would be expressed in `<tz>` by ``-0500'' while GMT is ``+0000''.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git-fast-import
  2007-02-06 16:35     ` git-fast-import Linus Torvalds
@ 2007-02-06 16:56       ` Shawn O. Pearce
  2007-02-06 17:20         ` git-fast-import Linus Torvalds
  0 siblings, 1 reply; 52+ messages in thread
From: Shawn O. Pearce @ 2007-02-06 16:56 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Nicolas Pitre, git

Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Tue, 6 Feb 2007, Shawn O. Pearce wrote:
> > Nicolas Pitre <nico@cam.org> wrote:
> > > I think this is quite error prone, demonstrated by the fact that we 
> > > screwed that up ourselves on a few occasions.  I think that the frontend 
> > > should be relieved from this by letting it provide the time of change in 
> > > a more natural format amongst all possible ones(like RFC2822 for 
> > > example) and gfi should simply give it to parse_date().
> > 
> > This is a really good point.  Its a little bit of work to switch
> > to parse_date(); I'll try to get it done tomorrow night.
> 
> Actually, I disagree. We've traditionally have had _less_ bugs with the 
> pure integer format than we ever had with RFC2822 format.

Hmm.  Actually I think it depends on the source data.  :-)

If the source is only supplying RFC2822 date format and is reliable
in its formatting of such, having gfi parse that rather than
the frontend is probably more reliable.  (Git already has a well
tested date parsing routine.)  But if the source is easily able
to get a time_t then that is just as easily formatted out to gfi,
and reading that without error is child's play.

After reading your email I'm now contemplating making this a command
line flag, like `--date-format=rfc2822`, so a frontend could ask
gfi to use parse_date() and whatever error that might bring, or
stick with the pure integer format.
 
> Having a hard format, set in stone, and totally unambiguous, is really a 
> good thing. It actually ends up resulting in fewer bugs in the end, 
> because it just makes sure that everybody is on the same page.

Which is why gfi is very strict about its handling of whitespace.
It assumes *exactly* one space between input fields, or *exactly*
one LF between commands.  Anything else is assumed to be part of
the next field.  If spaces show up in the imported data, its the
frontend that is sending stuff incorrectly.

Right now however gfi is not validating the author or committer
command arguments.  At all.  Which means that although the
documentation says the format must be such-and-such, gfi doesn't
care.  Whatever comes in on the `author` or `committer` line is
copied verbatim into the commit object.  gfi probably should at
least verify that the timestamp part of the line actually contains
digits.  :)

-- 
Shawn.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git-fast-import
  2007-02-06 16:56       ` git-fast-import Shawn O. Pearce
@ 2007-02-06 17:20         ` Linus Torvalds
  2007-02-06 18:53           ` git-fast-import Nicolas Pitre
  0 siblings, 1 reply; 52+ messages in thread
From: Linus Torvalds @ 2007-02-06 17:20 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Nicolas Pitre, git



On Tue, 6 Feb 2007, Shawn O. Pearce wrote:
> 
> If the source is only supplying RFC2822 date format and is reliable
> in its formatting of such, having gfi parse that rather than
> the frontend is probably more reliable.  (Git already has a well
> tested date parsing routine.)

I'm not so worried about the git date parsing routines (which are fairly 
solid) as about the fact that absolutely *tons* of people get rfc2822 
wrong.

And we'd never even see it, because gits date-parsing routines are very 
forgiving, and allow pretty much anything (and no, I'm not talking about 
approxidate(), which really *does* allow anything, I'm talking about the 
"strict" date parser). 

They allow pretty much any half-way valid date, exactly because people 
don't do rfc2822 right anyway (and because they are also meant to work 
even if you write the date by hand, like "12:34 2005-06-07").

And *particularly* when it comes to timezones, it just guesses. The whole 
daylight savings time thing is just too hard. And if no timezone exists, 
it will just take the current one, so things may *seem* like they work, 
but then two different people importing the *same* archive in two 
different locations will actually get different results!

THAT'S A BAD THING!

It's much better to specify the date so exactly that you simply cannot get 
different results with the same input.

Sure, you can still mess up the program that actually generates the data 
for gfi, and have bugs like that *there*, but at least they'd have to 
think a bit about it.

And the TZ problem is actually less likely if you have a strict TZ format. 
For example, when importing from CVS, the natural thing to do is to just 
always set TZ to +0000. Which gets you something reliable, and it won't 
depend on who did the import.

But hey, especially if it's a flag, and especially if it's *documented* 
that the date parsing will depend on the current timezone etc, then maybe 
it's all ok. It's certainly convenient to be able to give the date in any 
format. It's just very easy to get bugs when you allow any random crud..

		Linus

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git-fast-import
  2007-02-06  9:53 ` git-fast-import Jakub Narebski
@ 2007-02-06 17:20   ` Shawn O. Pearce
  0 siblings, 0 replies; 52+ messages in thread
From: Shawn O. Pearce @ 2007-02-06 17:20 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git

Jakub Narebski <jnareb@gmail.com> wrote:
> Shouldn't it be:
> `filemodify`
> ^^^^^^^^^^^^
> and:
> `filedelete`
> ^^^^^^^^^^^^

Yes, thanks, its fixed.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git-fast-import
  2007-02-06 16:44     ` git-fast-import Shawn O. Pearce
@ 2007-02-06 17:24       ` Linus Torvalds
  2007-02-07  1:17       ` git-fast-import Horst H. von Brand
  2007-02-07  4:45       ` git-fast-import Daniel Barkalow
  2 siblings, 0 replies; 52+ messages in thread
From: Linus Torvalds @ 2007-02-06 17:24 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Andy Parkins, git



On Tue, 6 Feb 2007, Shawn O. Pearce wrote:
> 
> What about this language?
> 
> 	The time of the change is specified by `<time>` as the number of
> 	seconds since the UNIX epoc (midnight, Jan 1, 1970, UTC) and is
> 	written in base-10 notation using US-ASCII digits.  The committer's
> 	timezone is specified by `<tz>` as a positive or negative offset
> 	from UTC.  For example EST (which is typically 5 hours behind GMT)
> 	would be expressed in `<tz>` by ``-0500'' while GMT is ``+0000''.

I doubt it would confuse anybody. Although usually we'd not say

	"in base-10 notation using US-ASCII digits"

the normal way to do that is to just saying "as an ASCII decimal integer".

Sure, people could try to do "10,200,300" and claim it's "decimal 
integer", but at that point, you can just tell them they're crazy, and 
ignore them ;)

But your text certainly isn't wrong. I just think it overspecifies a bit, 
at the expense of readability.

		Linus

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git-fast-import
  2007-02-06 13:50 ` git-fast-import Alex Riesen
@ 2007-02-06 17:43   ` Shawn O. Pearce
  2007-02-06 18:02     ` git-fast-import Alex Riesen
  0 siblings, 1 reply; 52+ messages in thread
From: Shawn O. Pearce @ 2007-02-06 17:43 UTC (permalink / raw)
  To: Alex Riesen; +Cc: git

Alex Riesen <raa.lkml@gmail.com> wrote:
> On 2/6/07, Shawn O. Pearce <spearce@spearce.org> wrote:
> >I'm mainly worried about breaking compliation on odd architectures.
> >gfi builds, runs and has been used for production level imports
> >on Mac OS X, Linux and Dragonfly BSD, using both 32 bit and 64 bit
> >architectures, but some of Git's other targets (e.g. AIX) haven't
> >seen any testing.
> 
> Compilation errors are the simplest to fix, just send it in.

True.

But it really is annoying when you download the latest-and-greatest
release of a package only to find out it doesn't compile on your
OS of choice, and even worse when you find out it is because of
new code that you will never use which was added in just before
the release went final!

> I have to import lots of data from perforce spaghetti, so I'm very
> likely to try it out.

I can't help you with spaghetti, but the Qt folks did make their
Perforce importer available.  Chris Lee put it in the fast-export
project on repo.or.cz.  Its a relatively short Python program.
Might help you get started.

They created annotated tags (with no message) for every p4 changeset.
I think its just because they didn't realize you can use (abuse?) the
`reset` command in gfi to create lightweight tags instead.


I actually implemented a "data <path" command in gfi to tell gfi
to load data from a file, for this type of case where the foreign
system has dropped the files in your working directory and you just
want Git to read them.

But there's no synchronization between gfi and the frontend (aside
from the pipe buffer throttling the frontend), so there is no way
for the frontend to know that gfi has finished a batch of files
and its safe to ask p4 for the next revision.

So I threw it away.  It was only a 10 line patch anyway.  :)

-- 
Shawn.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git-fast-import
  2007-02-06 17:43   ` git-fast-import Shawn O. Pearce
@ 2007-02-06 18:02     ` Alex Riesen
  0 siblings, 0 replies; 52+ messages in thread
From: Alex Riesen @ 2007-02-06 18:02 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: git

On 2/6/07, Shawn O. Pearce <spearce@spearce.org> wrote:
> Alex Riesen <raa.lkml@gmail.com> wrote:
> > On 2/6/07, Shawn O. Pearce <spearce@spearce.org> wrote:
> > >I'm mainly worried about breaking compliation on odd architectures.
> > >gfi builds, runs and has been used for production level imports
> > >on Mac OS X, Linux and Dragonfly BSD, using both 32 bit and 64 bit
> > >architectures, but some of Git's other targets (e.g. AIX) haven't
> > >seen any testing.
> >
> > Compilation errors are the simplest to fix, just send it in.
>
> True.
>
> But it really is annoying when you download the latest-and-greatest
> release of a package only to find out it doesn't compile on your
> OS of choice, and even worse when you find out it is because of
> new code that you will never use which was added in just before
> the release went final!

Than send it now! :)

> > I have to import lots of data from perforce spaghetti, so I'm very
> > likely to try it out.
>
> I can't help you with spaghetti, but the Qt folks did make their
> Perforce importer available.  Chris Lee put it in the fast-export
> project on repo.or.cz.  Its a relatively short Python program.
> Might help you get started.

Yes, I saw their code. That's how I started thinking of using gfi
in my p4 imports.

> They created annotated tags (with no message) for every p4 changeset.
> I think its just because they didn't realize you can use (abuse?) the
> `reset` command in gfi to create lightweight tags instead.

I found it's useless to do anything with p4 changes. They lack
the most important part of history: parent. The comments get
useless too, because they refer to the most recent change,
with no practical way to extract anything in between. Not much
of a problem, nobody writes anything sensible in perforce
comments anyway.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git-fast-import
  2007-02-06 17:20         ` git-fast-import Linus Torvalds
@ 2007-02-06 18:53           ` Nicolas Pitre
  2007-02-06 20:09             ` git-fast-import Shawn O. Pearce
  2007-02-07 10:58             ` git-fast-import David Woodhouse
  0 siblings, 2 replies; 52+ messages in thread
From: Nicolas Pitre @ 2007-02-06 18:53 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Shawn O. Pearce, git

On Tue, 6 Feb 2007, Linus Torvalds wrote:

> I'm not so worried about the git date parsing routines (which are fairly 
> solid) as about the fact that absolutely *tons* of people get rfc2822 
> wrong.
> 
> They allow pretty much any half-way valid date, exactly because people 
> don't do rfc2822 right anyway (and because they are also meant to work 
> even if you write the date by hand, like "12:34 2005-06-07").
> 
> Sure, you can still mess up the program that actually generates the data 
> for gfi, and have bugs like that *there*, but at least they'd have to 
> think a bit about it.

Well, exactly because GIT already has fairly solid date parsing 
routines, and the fact that we needed solid date parsing routines in the 
first place, exactly because people don't do rfc2822 right anyway, 
should be a hell of a big clue why we should parse date information for 
the gfi frontend.  Because the date is for sure most likely in a screwed 
up format already and it is counter productive to have to deal with that 
in a duplicated piece of code.  And the bare reality is that people will 
just not care to parse it right themselves.

Quoting from the gfi manual:

|A typical frontend for gfi tends to weigh in at approximately 200
|lines of Perl/Python/Ruby code.  Most developers have been able to
|create working importers in just a couple of hours, even though it
|is their first exposure to gfi, and sometimes even to Git.  This is
|an ideal situation, given that most conversion tools are throw-away
|(use once, and never look back).

This is therefore a damn good idea if gfi can make things right out of 
crap because frontends will not get much attention after the first "hey 
it works" level.  And the GIT date format, albeit being perfectly 
unambigous, is not inline with the statement above.

With the GIT date format a conversion _will_ be necessary in the 
frontend, while if gfi shove it to parse_date() instead then no 
conversion is even likely to be needed by the frontend.  I'd much prefer 
if frontend writers didn't have to care (and most probably manage to 
botch it if they have to) about date conversion.  We even botched it a 
few times ourselves despite the fact that we're damn good.

And because our date parsing code is damn good (hey we're just damn good 
aren't we?) I would bet that there will be much less conversion errors 
if gfi used parse_date() on provided data than if the frontend tries to 
parse the date itself.  This is wat we feed email submission through 
everyday anyway, so we must trust it to do a good job for imports as 
well.


Nicolas

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git-fast-import
  2007-02-06 18:53           ` git-fast-import Nicolas Pitre
@ 2007-02-06 20:09             ` Shawn O. Pearce
  2007-02-06 21:03               ` git-fast-import Nicolas Pitre
  2007-02-07 10:58             ` git-fast-import David Woodhouse
  1 sibling, 1 reply; 52+ messages in thread
From: Shawn O. Pearce @ 2007-02-06 20:09 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Linus Torvalds, git

Nicolas Pitre <nico@cam.org> wrote:
> This is therefore a damn good idea if gfi can make things right out of 
> crap because frontends will not get much attention after the first "hey 
> it works" level.  And the GIT date format, albeit being perfectly 
> unambigous, is not inline with the statement above.

Done.  I just pushed a change to gfi which adds `--date-format=<fmt>`.
For <fmt> you have the choice of:

  raw: Standard Git format.  This is the default, as its what
  the existing frontends by Chris Lee, Simon Hausmann, Jon Smirl,
  and Simon 'corecode' Schubert expect.

  rfc2822: Run whatever crap you give us through parse_date(),
  and cross your fingers.  If parse_date() returns < 0 we bomb
  out, but otherwise take it at its word.

  now: This is a toy, but useful if you really want now, dammit.
  We just call datestamp() and tack that in.  Note that the frontend
  must also supply the literal string `now` in the committer line
  (e.g. "committer A U Thor <at@example.com> now") to prevent us
  from bombing out.

The last one will probably get more useful when I fix gfi so it can
safely commit against active refs without losing commits (make it
do a strict fast-forward check before updating).  In this case it
may be useful for something like git-cvsserver, as it avoids the
need for a temporary directory, index, etc.
 
-- 
Shawn.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git-fast-import
  2007-02-06 20:09             ` git-fast-import Shawn O. Pearce
@ 2007-02-06 21:03               ` Nicolas Pitre
  2007-02-06 21:15                 ` git-fast-import Shawn O. Pearce
  0 siblings, 1 reply; 52+ messages in thread
From: Nicolas Pitre @ 2007-02-06 21:03 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Linus Torvalds, git

On Tue, 6 Feb 2007, Shawn O. Pearce wrote:

> Done.  I just pushed a change to gfi which adds `--date-format=<fmt>`.
> For <fmt> you have the choice of:
> 
>   raw: Standard Git format.  This is the default, as its what
>   the existing frontends by Chris Lee, Simon Hausmann, Jon Smirl,
>   and Simon 'corecode' Schubert expect.
> 
>   rfc2822: Run whatever crap you give us through parse_date(),
>   and cross your fingers.  If parse_date() returns < 0 we bomb
>   out, but otherwise take it at its word.

I think you should call it something else than rfc2822.  Because 
parse_date() accepts much more than just rfc2822.  What about "cooked"?


Nicolas

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git-fast-import
  2007-02-06 21:03               ` git-fast-import Nicolas Pitre
@ 2007-02-06 21:15                 ` Shawn O. Pearce
  2007-02-06 21:42                   ` git-fast-import Nicolas Pitre
  0 siblings, 1 reply; 52+ messages in thread
From: Shawn O. Pearce @ 2007-02-06 21:15 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Linus Torvalds, git

Nicolas Pitre <nico@cam.org> wrote:
> I think you should call it something else than rfc2822.  Because 
> parse_date() accepts much more than just rfc2822.  What about "cooked"?

It does accept a lot more than that, but straying away from rfc2822
gets into the grey areas of parse_date().  E.g. it matches crap such
as 'yyyy-mm-dd' or 'yyyy-dd-mm'.  But that is completely ambiguous!

I don't really want to advertise that it is accepting non-RFC 2822
input here.  I was thinking of doing an `iso` (yyyy-mm-dd hh:mm:ss)
format, which may just defer into parse_date(), but again encourage
the frontend to *only* feed that ISO style format.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git-fast-import
  2007-02-06 21:15                 ` git-fast-import Shawn O. Pearce
@ 2007-02-06 21:42                   ` Nicolas Pitre
  0 siblings, 0 replies; 52+ messages in thread
From: Nicolas Pitre @ 2007-02-06 21:42 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Linus Torvalds, git

On Tue, 6 Feb 2007, Shawn O. Pearce wrote:

> Nicolas Pitre <nico@cam.org> wrote:
> > I think you should call it something else than rfc2822.  Because 
> > parse_date() accepts much more than just rfc2822.  What about "cooked"?
> 
> It does accept a lot more than that, but straying away from rfc2822
> gets into the grey areas of parse_date().

OK that makes sense.


Nicolas

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git-fast-import
  2007-02-06 16:44     ` git-fast-import Shawn O. Pearce
  2007-02-06 17:24       ` git-fast-import Linus Torvalds
@ 2007-02-07  1:17       ` Horst H. von Brand
  2007-02-07  2:50         ` git-fast-import Linus Torvalds
  2007-02-07  5:46         ` git-fast-import Shawn O. Pearce
  2007-02-07  4:45       ` git-fast-import Daniel Barkalow
  2 siblings, 2 replies; 52+ messages in thread
From: Horst H. von Brand @ 2007-02-07  1:17 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Linus Torvalds, Andy Parkins, git

Shawn O. Pearce <spearce@spearce.org> wrote:

[...]

> What about this language?
> 
> 	The time of the change is specified by `<time>` as the number of
> 	seconds since the UNIX epoc (midnight, Jan 1, 1970, UTC) and is
> 	written in base-10 notation using US-ASCII digits.  The committer's
> 	timezone is specified by `<tz>` as a positive or negative offset
> 	from UTC.  For example EST (which is typically 5 hours behind GMT)
> 	would be expressed in `<tz>` by ``-0500'' while GMT is ``+0000''.

That is /not/ a timezone! Maybe an offset from UTC.
-- 
Dr. Horst H. von Brand                   User #22616 counter.li.org
Departamento de Informatica                    Fono: +56 32 2654431
Universidad Tecnica Federico Santa Maria             +56 32 2654239
Casilla 110-V, Valparaiso, Chile               Fax:  +56 32 2797513

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git-fast-import
  2007-02-07  1:17       ` git-fast-import Horst H. von Brand
@ 2007-02-07  2:50         ` Linus Torvalds
  2007-02-07  5:53           ` git-fast-import Shawn O. Pearce
  2007-02-08 21:34           ` git-fast-import Johannes Schindelin
  2007-02-07  5:46         ` git-fast-import Shawn O. Pearce
  1 sibling, 2 replies; 52+ messages in thread
From: Linus Torvalds @ 2007-02-07  2:50 UTC (permalink / raw)
  To: Horst H. von Brand; +Cc: Shawn O. Pearce, Andy Parkins, git



On Tue, 6 Feb 2007, Horst H. von Brand wrote:
>
> Shawn O. Pearce <spearce@spearce.org> wrote:
> 
> [...]
> 
> > What about this language?
> > 
> > 	The time of the change is specified by `<time>` as the number of
> > 	seconds since the UNIX epoc (midnight, Jan 1, 1970, UTC) and is
> > 	written in base-10 notation using US-ASCII digits.  The committer's
> > 	timezone is specified by `<tz>` as a positive or negative offset
> > 	from UTC.  For example EST (which is typically 5 hours behind GMT)
> > 	would be expressed in `<tz>` by ``-0500'' while GMT is ``+0000''.
> 
> That is /not/ a timezone! Maybe an offset from UTC.

Btw, one thing that might be a good idea to document very clearly:

 - in the native git format, the offset from UTC has *nothing* to do with 
   the actual time itself. The time in native git is always in UTC, and 
   the offset from UTC does not change "time" - it's purely there to tell 
   in which timezone the event happened.

   So 12345678 +0000 and 12345678 -0700 are *exactly*the*same*date*, 
   except event one happened in UTC, and the other happened in UTC-7.

 - in rfc2822 format, the offset from UTC actually *changes* the date. The 
   date "Oct 12, 2006 20:00:00" will be two _different_ times when you say 
   it is in PST or in UTC.

And yes, for all I know we might get this wrong inside git too. It's easy 
to get confused, because they really do mean different things.

For an example of this, do

	make test-date

in git (which parses the argument using the "exact date" and "approxidate" 
versions respectively, and the exact date parsing will give the internal 
git representation on the first line in the middle column), and then:

	./test-date "1234567890 -0800"
	./test-date "1234567890 +0000"

and then try

	./test-date "Fri Feb 13 15:31:30 2009 PST"
	./test-date "Fri Feb 13 15:31:30 2009 UTC"

and notice how the first two (numeric) dates that differ in UTC offset 
will still return the exact same seconds since the epoch:

	1234567890 -0800
	1234567890 +0000

but the second example (with a rfc2822-like date), will show how the 
seconds-since-epoch changes, and gives:

	1234567890 -0800
	1234539090 +0000

respectively for those two dates.

Logical? It actually is, but you have to understand how git represents 
date to see the logic. To git, the "timezone" is really totally 
irrelevant. It doesn't really affect the "date" at all. At most, it 
affects how you _print_ the date, and you can tell what timezone the 
computer was set to when the commit was made.

And yes, I would not be at all surprised if we had some bug here where we 
got it wrong occasionally.

		Linus

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git-fast-import
  2007-02-06 16:44     ` git-fast-import Shawn O. Pearce
  2007-02-06 17:24       ` git-fast-import Linus Torvalds
  2007-02-07  1:17       ` git-fast-import Horst H. von Brand
@ 2007-02-07  4:45       ` Daniel Barkalow
  2 siblings, 0 replies; 52+ messages in thread
From: Daniel Barkalow @ 2007-02-07  4:45 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Linus Torvalds, Andy Parkins, git

On Tue, 6 Feb 2007, Shawn O. Pearce wrote:

> What about this language?
> 
> 	The time of the change is specified by `<time>` as the number of
> 	seconds since the UNIX epoc (midnight, Jan 1, 1970, UTC) and is
> 	written in base-10 notation using US-ASCII digits.  The committer's
> 	timezone is specified by `<tz>` as a positive or negative offset
> 	from UTC.  For example EST (which is typically 5 hours behind GMT)
> 	would be expressed in `<tz>` by ``-0500'' while GMT is ``+0000''.

EST is always 5 hours behind GMT. During the summer, EST is still 5 hours 
behind GMT, but the clocks which use ET are set to EDT (-0400) instead. 

	-Daniel
*This .sig left intentionally blank*

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git-fast-import
  2007-02-06  6:18   ` git-fast-import Shawn O. Pearce
@ 2007-02-07  4:55     ` Daniel Barkalow
  2007-02-07  9:13       ` git-fast-import Karl Hasselström
                         ` (2 more replies)
  0 siblings, 3 replies; 52+ messages in thread
From: Daniel Barkalow @ 2007-02-07  4:55 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Aneesh Kumar K.V, git

On Tue, 6 Feb 2007, Shawn O. Pearce wrote:

> "Aneesh Kumar K.V" <aneesh.kumar@gmail.com> wrote:
> > >SYNOPSIS
> > >--------
> > >frontend | 'git-fast-import' [options]
> > >
> > 
> > Do we have example frontend  that can be added along with gfi ?
> 
> Not yet.  Some frontends are available here on repo.or.cz:
> 
>   gitweb: http://repo.or.cz/w/fast-export.git
>   clone:  git://repo.or.cz/fast-export.git
> 
> But both lack branch support, for example, so they probably aren't
> nearly as complete as the existing non-gfi based importers.

It might be nice to have a git-fast-export, which could actually be 
potentially useful for generating a repository with systematic differences 
from the original. (E.g., to make a repository of git's Documentation 
directory, with just the commits that affect it)

That might also be a big help to projects that find they should have been 
using more, fewer, or different repositories through their history.

Also, I'd guess that it would be pretty straightforward and easy to 
understand, plus easy to verify correctness on large examples with.

	-Daniel
*This .sig left intentionally blank*

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git-fast-import
  2007-02-07  1:17       ` git-fast-import Horst H. von Brand
  2007-02-07  2:50         ` git-fast-import Linus Torvalds
@ 2007-02-07  5:46         ` Shawn O. Pearce
  1 sibling, 0 replies; 52+ messages in thread
From: Shawn O. Pearce @ 2007-02-07  5:46 UTC (permalink / raw)
  To: Horst H. von Brand; +Cc: Linus Torvalds, Andy Parkins, git

"Horst H. von Brand" <vonbrand@inf.utfsm.cl> wrote:
> Shawn O. Pearce <spearce@spearce.org> wrote:
> > 	written in base-10 notation using US-ASCII digits.  The committer's
> > 	timezone is specified by `<tz>` as a positive or negative offset
> > 	from UTC.  For example EST (which is typically 5 hours behind GMT)
> > 	would be expressed in `<tz>` by ``-0500'' while GMT is ``+0000''.
> 
> That is /not/ a timezone! Maybe an offset from UTC.

Indeed.  Thank you for the correction.  I'll push out fixed docs
shortly.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git-fast-import
  2007-02-07  2:50         ` git-fast-import Linus Torvalds
@ 2007-02-07  5:53           ` Shawn O. Pearce
  2007-02-07  9:21             ` git-fast-import Karl Hasselström
  2007-02-07 22:18             ` git-fast-import Horst H. von Brand
  2007-02-08 21:34           ` git-fast-import Johannes Schindelin
  1 sibling, 2 replies; 52+ messages in thread
From: Shawn O. Pearce @ 2007-02-07  5:53 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Horst H. von Brand, Andy Parkins, git

Linus Torvalds <torvalds@linux-foundation.org> wrote:
> Btw, one thing that might be a good idea to document very clearly:
> 
>  - in the native git format, the offset from UTC has *nothing* to do with 
>    the actual time itself. The time in native git is always in UTC, and 
>    the offset from UTC does not change "time" - it's purely there to tell 
>    in which timezone the event happened.
> 
>    So 12345678 +0000 and 12345678 -0700 are *exactly*the*same*date*, 
>    except event one happened in UTC, and the other happened in UTC-7.
> 
>  - in rfc2822 format, the offset from UTC actually *changes* the date. The 
>    date "Oct 12, 2006 20:00:00" will be two _different_ times when you say 
>    it is in PST or in UTC.

Here is the current language relating to date parsing in gfi:

Date Formats
~~~~~~~~~~~~
The following date formats are supported.  A frontend should select
the format it will use for this import by passing the format name
in the `--date-format=<fmt>` command line option.

`raw`::
	This is the Git native format and is `<time> SP <offutc>`.
	It is also gfi's default format, if `--date-format` was
	not specified.
+
The time of the event is specified by `<time>` as the number of
seconds since the UNIX epoch (midnight, Jan 1, 1970, UTC) and is
written as an ASCII decimal integer.
+
The local offset is specified by `<offutc>` as a positive or negative
offset from UTC.  For example EST (which is 5 hours behind UTC)
would be expressed in `<tz>` by ``-0500'' while UTC is ``+0000''.
The local offset does not affect `<time>`; it is used only as an
advisement to help formatting routines display the timestamp.
+
If the local offset is not available in the source material, use
``+0000'', or the most common local offset.  For example many
organizations have a CVS repository which has only ever been accessed
by users who are located in the same location and timezone.  In this
case the offset from UTC can be easily assumed.
+
Unlike the `rfc2822` format, this format is very strict.  Any
variation in formatting will cause gfi to reject the value.

`rfc2822`::
	This is the standard email format as described by RFC 2822.
+
An example value is ``Tue Feb 6 11:22:18 2007 -0500''.  The Git
parser is accurate, but a little on the lenient side.  Its the
same parser used by gitlink:git-am[1] when applying patches
received from email.
+
Some malformed strings may be accepted as valid dates.  In some of
these cases Git will still be able to obtain the correct date from
the malformed string.  There are also some types of malformed
strings which Git will parse wrong, and yet consider valid.
Seriously malformed strings will be rejected.
+
Unlike the `raw` format above, the timezone/UTC offset information
contained in an RFC 2822 date string is used to adjust the date
value to UTC prior to storage.  Therefore it is important that
this information be as accurate as possible.
+
If the source material is formatted in RFC 2822 style dates,
the frontend should let gfi handle the parsing and conversion
(rather than attempting to do it itself) as the Git parser has
been well tested in the wild.
+
Frontends should prefer the `raw` format if the source material
is already in UNIX-epoch format, or is easily convertible to
that format, as there is no ambiguity in parsing.

`now`::
	Always use the current time and timezone.  The literal
	`now` must always be supplied for `<when>`.
+
This is a toy format.  The current time and timezone of this system
is always copied into the identity string at the time it is being
created by gfi.  There is no way to specify a different time or
timezone.
+
This particular format is supplied as its short to implement and
may be useful to a process that wants to create a new commit
right now, without needing to use a working directory or
gitlink:git-update-index[1].
+
If separate `author` and `committer` commands are used in a `commit`
the timestamps may not match, as the system clock will be polled
twice (once for each command).  The only way to ensure that both
author and committer identity information has the same timestamp
is to omit `author` (thus copying from `committer`) or to use a
date format other than `now`.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git-fast-import
  2007-02-07  4:55     ` git-fast-import Daniel Barkalow
@ 2007-02-07  9:13       ` Karl Hasselström
  2007-02-07 11:17         ` git-fast-import Johannes Schindelin
  2007-02-07  9:29       ` git-fast-import Raimund Bauer
  2007-02-07 13:38       ` git-fast-import David Woodhouse
  2 siblings, 1 reply; 52+ messages in thread
From: Karl Hasselström @ 2007-02-07  9:13 UTC (permalink / raw)
  To: Daniel Barkalow; +Cc: Shawn O. Pearce, Aneesh Kumar K.V, git

On 2007-02-06 23:55:46 -0500, Daniel Barkalow wrote:

> It might be nice to have a git-fast-export, which could actually be
> potentially useful for generating a repository with systematic
> differences from the original. (E.g., to make a repository of git's
> Documentation directory, with just the commits that affect it)

Or to solve problems like

  Gaaah! This file we've had in the repository for the last 17 months
  has copyright problems and we can't distribute it!

or

  Wouldn't it be nice to permanently include all that old Linux
  history that's currently grafted onto the "real" history?

In other words, general history rewriting, but fast.

(Disclaimer: I've never tried to use the history rewrite tool that
Cogito has, so I don't know its limitations, or how fast it is.)

-- 
Karl Hasselström, kha@treskal.com
      www.treskal.com/kalle

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git-fast-import
  2007-02-07  5:53           ` git-fast-import Shawn O. Pearce
@ 2007-02-07  9:21             ` Karl Hasselström
  2007-02-07 22:18             ` git-fast-import Horst H. von Brand
  1 sibling, 0 replies; 52+ messages in thread
From: Karl Hasselström @ 2007-02-07  9:21 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Linus Torvalds, Horst H. von Brand, Andy Parkins, git

On 2007-02-07 00:53:52 -0500, Shawn O. Pearce wrote:

> Its the same parser used by gitlink:git-am[1] when applying patches
> received from email.

Should be "It's" or "It is".

-- 
Karl Hasselström, kha@treskal.com
      www.treskal.com/kalle

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: git-fast-import
  2007-02-07  4:55     ` git-fast-import Daniel Barkalow
  2007-02-07  9:13       ` git-fast-import Karl Hasselström
@ 2007-02-07  9:29       ` Raimund Bauer
  2007-02-07 13:38       ` git-fast-import David Woodhouse
  2 siblings, 0 replies; 52+ messages in thread
From: Raimund Bauer @ 2007-02-07  9:29 UTC (permalink / raw)
  To: 'Daniel Barkalow', 'Shawn O. Pearce'
  Cc: 'Aneesh Kumar K.V', git

> It might be nice to have a git-fast-export, which could actually be 
> potentially useful for generating a repository with 
> systematic differences 
> >from the original. (E.g., to make a repository of git's Documentation
> directory, with just the commits that affect it)

Search the list-archives for "git-split", that may be what you're looking
for.

-- 
best regards

  Ray

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git-fast-import
  2007-02-06 18:53           ` git-fast-import Nicolas Pitre
  2007-02-06 20:09             ` git-fast-import Shawn O. Pearce
@ 2007-02-07 10:58             ` David Woodhouse
  1 sibling, 0 replies; 52+ messages in thread
From: David Woodhouse @ 2007-02-07 10:58 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Linus Torvalds, Shawn O. Pearce, git

On Tue, 2007-02-06 at 13:53 -0500, Nicolas Pitre wrote:
> On Tue, 6 Feb 2007, Linus Torvalds wrote:
> 
> > I'm not so worried about the git date parsing routines (which are fairly 
> > solid) as about the fact that absolutely *tons* of people get rfc2822 
> > wrong.
> > 
> > They allow pretty much any half-way valid date, exactly because people 
> > don't do rfc2822 right anyway (and because they are also meant to work 
> > even if you write the date by hand, like "12:34 2005-06-07").
> > 
> > Sure, you can still mess up the program that actually generates the data 
> > for gfi, and have bugs like that *there*, but at least they'd have to 
> > think a bit about it.
> 
> Well, exactly because GIT already has fairly solid date parsing 
> routines, and the fact that we needed solid date parsing routines in the 
> first place, exactly because people don't do rfc2822 right anyway, 
> should be a hell of a big clue why we should parse date information for 
> the gfi frontend.  Because the date is for sure most likely in a screwed 
> up format already and it is counter productive to have to deal with that 
> in a duplicated piece of code.  And the bare reality is that people will 
> just not care to parse it right themselves. 

Nevertheless, they _should_. The principle is simple -- wherever there
is ambiguity, you should seek to resolve that as _close_ to the point of
origin as possible. Your 'best guess' gets worse and worse the further
you go from the source of the data.

If you're exporting from a legacy repository in one part of the world,
then transferring the raw data to a machine elsewhere to be imported
into git, you _really_ want to be making your guesses about timezones
and character sets in the _export_ stage; not the subsequent import.

So there's a lot to be said for nailing down gfi's intermediate format
and removing _all_ the ambiguity from it -- using git format dates
(which I did that way precisely for the lack of ambiguity), and using
UTF-8 (or some other _specified_ but not assumed character set).

-- 
dwmw2

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git-fast-import
  2007-02-07  9:13       ` git-fast-import Karl Hasselström
@ 2007-02-07 11:17         ` Johannes Schindelin
  2007-02-07 22:55           ` git-fast-import Shawn O. Pearce
  0 siblings, 1 reply; 52+ messages in thread
From: Johannes Schindelin @ 2007-02-07 11:17 UTC (permalink / raw)
  To: Karl Hasselström
  Cc: Daniel Barkalow, Shawn O. Pearce, Aneesh Kumar K.V, git

[-- Attachment #1: Type: TEXT/PLAIN, Size: 950 bytes --]

Hi,

On Wed, 7 Feb 2007, Karl Hasselström wrote:

> On 2007-02-06 23:55:46 -0500, Daniel Barkalow wrote:
> 
> > It might be nice to have a git-fast-export, which could actually be
> > potentially useful for generating a repository with systematic
> > differences from the original. (E.g., to make a repository of git's
> > Documentation directory, with just the commits that affect it)
> 
> Or to solve problems like
> 
>   Gaaah! This file we've had in the repository for the last 17 months
>   has copyright problems and we can't distribute it!
> 
> or
> 
>   Wouldn't it be nice to permanently include all that old Linux
>   history that's currently grafted onto the "real" history?
> 
> In other words, general history rewriting, but fast.

For this, it would be better to use a different approach: fast-import 
still hashes all the objects, which would not be necessary when rewriting. 
I guess that is what cogito's tool is doing.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git-fast-import
  2007-02-07  4:55     ` git-fast-import Daniel Barkalow
  2007-02-07  9:13       ` git-fast-import Karl Hasselström
  2007-02-07  9:29       ` git-fast-import Raimund Bauer
@ 2007-02-07 13:38       ` David Woodhouse
  2 siblings, 0 replies; 52+ messages in thread
From: David Woodhouse @ 2007-02-07 13:38 UTC (permalink / raw)
  To: Daniel Barkalow; +Cc: Shawn O. Pearce, Aneesh Kumar K.V, git

On Tue, 2007-02-06 at 23:55 -0500, Daniel Barkalow wrote:
> It might be nice to have a git-fast-export, which could actually be 
> potentially useful for generating a repository with systematic differences 
> from the original. (E.g., to make a repository of git's Documentation 
> directory, with just the commits that affect it) 

That kind of thing isn't hard to do. See the scripts which create the
'JFFS2 for eCos' git tree or the 'exported kernel headers' git tree,
directly from Linus' git tree.

-- 
dwmw2

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git-fast-import
  2007-02-07  5:53           ` git-fast-import Shawn O. Pearce
  2007-02-07  9:21             ` git-fast-import Karl Hasselström
@ 2007-02-07 22:18             ` Horst H. von Brand
  2007-02-07 22:31               ` git-fast-import Jakub Narebski
  2007-02-07 22:39               ` git-fast-import Linus Torvalds
  1 sibling, 2 replies; 52+ messages in thread
From: Horst H. von Brand @ 2007-02-07 22:18 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Linus Torvalds, Horst H. von Brand, Andy Parkins, git

Shawn O. Pearce <spearce@spearce.org> wrote:
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > Btw, one thing that might be a good idea to document very clearly:
> > 
> >  - in the native git format, the offset from UTC has *nothing* to do with 
> >    the actual time itself. The time in native git is always in UTC, and 
> >    the offset from UTC does not change "time" - it's purely there to tell 
> >    in which timezone the event happened.
> > 
> >    So 12345678 +0000 and 12345678 -0700 are *exactly*the*same*date*, 
> >    except event one happened in UTC, and the other happened in UTC-7.
> > 
> >  - in rfc2822 format, the offset from UTC actually *changes* the date. The 
> >    date "Oct 12, 2006 20:00:00" will be two _different_ times when you say 
> >    it is in PST or in UTC.
> 
> Here is the current language relating to date parsing in gfi:
> 
> Date Formats
> ~~~~~~~~~~~~

[...]

> +
> If the local offset is not available in the source material, use
> ``+0000'', or the most common local offset.  For example many
> organizations have a CVS repository which has only ever been accessed
> by users who are located in the same location and timezone.  In this
> case the offset from UTC can be easily assumed.

No, it can't. There are summer/winter times, etc.

> +
> Unlike the `rfc2822` format, this format is very strict.  Any
> variation in formatting will cause gfi to reject the value.
> 
> `rfc2822`::
> 	This is the standard email format as described by RFC 2822.
> +
> An example value is ``Tue Feb 6 11:22:18 2007 -0500''.  The Git
> parser is accurate, but a little on the lenient side.  Its the
> same parser used by gitlink:git-am[1] when applying patches
> received from email.
> +
> Some malformed strings may be accepted as valid dates.  In some of
> these cases Git will still be able to obtain the correct date from
> the malformed string.  There are also some types of malformed
> strings which Git will parse wrong, and yet consider valid.
> Seriously malformed strings will be rejected.
> +
> Unlike the `raw` format above, the timezone/UTC offset information
> contained in an RFC 2822 date string is used to adjust the date
> value to UTC prior to storage.  Therefore it is important that
> this information be as accurate as possible.

Say what? If I use the "raw" format with UTC offset, the offset is just
ignored then?

> +
> If the source material is formatted in RFC 2822 style dates,

"uses RFC 2822 style dates" would be better

> the frontend should let gfi handle the parsing and conversion
> (rather than attempting to do it itself) as the Git parser has
> been well tested in the wild.
> +
> Frontends should prefer the `raw` format if the source material
> is already in UNIX-epoch format, or is easily convertible to

"already uses Unix-epoch format, can be coaxed to give dates in that
format, or its format is easily convertible to it" sounds better to me

> that format, as there is no ambiguity in parsing.
> 
> `now`::
> 	Always use the current time and timezone.  The literal
> 	`now` must always be supplied for `<when>`.

[...]

> +
> If separate `author` and `committer` commands are used in a `commit`
> the timestamps may not match, as the system clock will be polled
> twice (once for each command).

Better fix that. It can't be that costly to call gettimeofday(2) once and
squirrel the result away for later use.

>                                 The only way to ensure that both
> author and committer identity information has the same timestamp
> is to omit `author` (thus copying from `committer`) or to use a
> date format other than `now`.

See?
-- 
Dr. Horst H. von Brand                   User #22616 counter.li.org
Departamento de Informatica                    Fono: +56 32 2654431
Universidad Tecnica Federico Santa Maria             +56 32 2654239
Casilla 110-V, Valparaiso, Chile               Fax:  +56 32 2797513

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git-fast-import
  2007-02-07 22:18             ` git-fast-import Horst H. von Brand
@ 2007-02-07 22:31               ` Jakub Narebski
  2007-02-07 22:39               ` git-fast-import Linus Torvalds
  1 sibling, 0 replies; 52+ messages in thread
From: Jakub Narebski @ 2007-02-07 22:31 UTC (permalink / raw)
  To: git

Horst H. von Brand wrote:
> Shawn O. Pearce <spearce@spearce.org> wrote:

>> Unlike the `raw` format above, the timezone/UTC offset information
>> contained in an RFC 2822 date string is used to adjust the date
>> value to UTC prior to storage.  Therefore it is important that
>> this information be as accurate as possible.
> 
> Say what? If I use the "raw" format with UTC offset, the offset is just
> ignored then?

It is saved, and used only when _displaying_ human readable date.
-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git-fast-import
  2007-02-07 22:18             ` git-fast-import Horst H. von Brand
  2007-02-07 22:31               ` git-fast-import Jakub Narebski
@ 2007-02-07 22:39               ` Linus Torvalds
  1 sibling, 0 replies; 52+ messages in thread
From: Linus Torvalds @ 2007-02-07 22:39 UTC (permalink / raw)
  To: Horst H. von Brand; +Cc: Shawn O. Pearce, Andy Parkins, git



On Wed, 7 Feb 2007, Horst H. von Brand wrote:
> 
> Say what? If I use the "raw" format with UTC offset, the offset is just
> ignored then?

The offset that git maintaines is basically always ignored by git except 
for pure printout purposes.

For example, when you traverse commits, git normally picks the next 
reachable commit to show by using the date. The UTC offset has no effect 
on anything.

In fact, when we parse a commit, we don't even *parse* the timezone info. 
Look in commit.c: parse_commit_date. The timezone really doesn't even 
exist as far as any "real" git operation is concerned. It's just saved 
away, and it's _shown_ in "git log", but it has no real meaning apart from 
that.

So git very much only works on UTC time internally, and the only thing 
that actually matters in a string like "1234567890 -0700" is the first 
part. The "-0700" is _literally_ just a comment that is only ever even 
parsed by "pretty_print_commit()".

Btw, CVS doesn't have any TZ info at all, so CVS also internally always 
saves in UTC. It then tends to print out logs in whatever timezone you 
happen to be in at the time of printout, afaik. 

		Linus

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git-fast-import
  2007-02-07 11:17         ` git-fast-import Johannes Schindelin
@ 2007-02-07 22:55           ` Shawn O. Pearce
  2007-02-07 23:55             ` git-fast-import Johannes Schindelin
  0 siblings, 1 reply; 52+ messages in thread
From: Shawn O. Pearce @ 2007-02-07 22:55 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Karl Hasselström, Daniel Barkalow, Aneesh Kumar K.V, git

Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> On Wed, 7 Feb 2007, Karl Hasselström wrote:
> > In other words, general history rewriting, but fast.
> 
> For this, it would be better to use a different approach: fast-import 
> still hashes all the objects, which would not be necessary when rewriting. 
> I guess that is what cogito's tool is doing.

gfi doesn't require that it rehash blob objects.

If the blobs in question are already available in the repository
gfi is running against (say, from the old branch history) you
can just feed those blob SHA-1s at gfi in its 'M' commands when
making commits.  Yes gfi will need to recompute the tree hashes
from scratch, but those are certainly smaller and faster to create
than blobs.

So you probably could make a faster history rewriter by taking
the output of say `git log --pretty=raw --raw -z`, filter that and
reverse it, and stream it into gfi.  It probably would kick Cogito's
cg-admin-rewritefilter thing in the teeth, as you are forking just
one gfi process rather than a thousand git-commit-tree processes.

And if you are doing more complex pathname translations than just
picking out a subtree, it also completely avoids needing to read and
write index files via update-index, or tree objects by write-tree.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git-fast-import
  2007-02-07 22:55           ` git-fast-import Shawn O. Pearce
@ 2007-02-07 23:55             ` Johannes Schindelin
  2007-02-08  0:12               ` git-fast-import Shawn O. Pearce
  2007-02-08 16:56               ` git-fast-import Linus Torvalds
  0 siblings, 2 replies; 52+ messages in thread
From: Johannes Schindelin @ 2007-02-07 23:55 UTC (permalink / raw)
  To: Shawn O. Pearce
  Cc: Karl Hasselström, Daniel Barkalow, Aneesh Kumar K.V, git

[-- Attachment #1: Type: TEXT/PLAIN, Size: 860 bytes --]

Hi,

On Wed, 7 Feb 2007, Shawn O. Pearce wrote:

> Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> > On Wed, 7 Feb 2007, Karl Hasselström wrote:
> > > In other words, general history rewriting, but fast.
> > 
> > For this, it would be better to use a different approach: fast-import 
> > still hashes all the objects, which would not be necessary when rewriting. 
> > I guess that is what cogito's tool is doing.
> 
> gfi doesn't require that it rehash blob objects.
> 
> If the blobs in question are already available in the repository
> gfi is running against (say, from the old branch history) you
> can just feed those blob SHA-1s at gfi in its 'M' commands when
> making commits.

Ah! I overlooked that feature. Certainly, this makes gfi (could we please 
call it "fast-import", please?) very useful for history rewriting 
purposed.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git-fast-import
  2007-02-07 23:55             ` git-fast-import Johannes Schindelin
@ 2007-02-08  0:12               ` Shawn O. Pearce
  2007-02-08 16:56               ` git-fast-import Linus Torvalds
  1 sibling, 0 replies; 52+ messages in thread
From: Shawn O. Pearce @ 2007-02-08  0:12 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Karl Hasselström, Daniel Barkalow, Aneesh Kumar K.V, git

Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> Ah! I overlooked that feature. Certainly, this makes gfi (could we please 
> call it "fast-import", please?) very useful for history rewriting 
> purposed.

Heh.  I was actually sort of thinking of renaming it git-gfi.  :)

git-fast-import is just too long to write.  And for some reason I
have been writing it a lot lately.  #git, email, git-fast-export's
manual page (which is now also the largest manual page in all
of Git!).

But of course the better name is git-fast-import.  Stealing a
three-letter non-hypen-containing name for a tool the user never
is meant to run by hand is just evil.


I haven't even tried to use fast-import for general history
rewriting, let alone benchmarked it against something like git-split
or Cogito's rewriting tool, but I'd be willing to be that fast-import
is faster.  The internal ``cache'' that it uses for the tree
construction is lightweight enough that gfi can probably recreate
only the modified trees, compress and hash them, and output what
it needs to, in the time it takes to fork+exec git-commit-tree.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git-fast-import
  2007-02-07 23:55             ` git-fast-import Johannes Schindelin
  2007-02-08  0:12               ` git-fast-import Shawn O. Pearce
@ 2007-02-08 16:56               ` Linus Torvalds
  2007-02-08 19:10                 ` git-fast-import Shawn O. Pearce
  1 sibling, 1 reply; 52+ messages in thread
From: Linus Torvalds @ 2007-02-08 16:56 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Shawn O. Pearce, Karl Hasselström, Daniel Barkalow,
	Aneesh Kumar K.V, git



On Thu, 8 Feb 2007, Johannes Schindelin wrote:
> 
> Ah! I overlooked that feature. Certainly, this makes gfi (could we please 
> call it "fast-import", please?) very useful for history rewriting 
> purposed.

Yeah, I think fast-import is great. And I'd also like to echo that call to 
not call it "gfi". Maybe it's just me, and maybe it's just because I'm a 
home-owner who does things like add in-wall ethernet cables, but to me, 
gfi is about an electrical outlet.

So to me, gfi means "ground fault interrupter": the kind of outlet that 
breaks the circuit if there is current leaking to the ground pin. All your 
electrical outlets in "wet areas" (bathroom, kitchen within a certain 
distance of a sink, outside, near swimming pools etc) are supposed to be 
GFI's.

I realize that there's not a lot of chance of confusion in the git world, 
but still.

			Linus

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git-fast-import
  2007-02-08 16:56               ` git-fast-import Linus Torvalds
@ 2007-02-08 19:10                 ` Shawn O. Pearce
  2007-02-09  8:49                   ` git-fast-import Karl Hasselström
  0 siblings, 1 reply; 52+ messages in thread
From: Shawn O. Pearce @ 2007-02-08 19:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Johannes Schindelin, Karl Hasselström, Daniel Barkalow,
	Aneesh Kumar K.V, git

Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Thu, 8 Feb 2007, Johannes Schindelin wrote:
> > 
> > Ah! I overlooked that feature. Certainly, this makes gfi (could we please 
> > call it "fast-import", please?) very useful for history rewriting 
> > purposed.
> 
> Yeah, I think fast-import is great. And I'd also like to echo that call to 
> not call it "gfi". Maybe it's just me, and maybe it's just because I'm a 
> home-owner who does things like add in-wall ethernet cables, but to me, 
> gfi is about an electrical outlet.

OK.  There happen to be 78 uses of `gfi` in the manpage.
I'll correct the spelling to fast-import.  :-)

-- 
Shawn.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git-fast-import
  2007-02-07  2:50         ` git-fast-import Linus Torvalds
  2007-02-07  5:53           ` git-fast-import Shawn O. Pearce
@ 2007-02-08 21:34           ` Johannes Schindelin
  1 sibling, 0 replies; 52+ messages in thread
From: Johannes Schindelin @ 2007-02-08 21:34 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Horst H. von Brand, Shawn O. Pearce, Andy Parkins, git

Hi,

On Tue, 6 Feb 2007, Linus Torvalds wrote:

> [Talks about timestamps being in UTC, even if augmented by a timezone]
>
> And yes, for all I know we might get this wrong inside git too. It's 
> easy to get confused, because they really do mean different things.

FWIW I just grepped git for tz, and looked at the results. The place I had 
to think a bit more about was in builtin-blame.c:format_time(). Probably a 
special date format is needed to stay compatible with cvsserver, otherwise 
show_date() or even show_rfc2822_date() could be used.

The code actually adds the timezone in minutes to the timestamp, and then 
calls gmtime() to be able to format the date with strftime() (something 
similar, without strftime() is done in show_[rfc2822_]date()). The result 
is correct AFAICT, although it would be cleaner IMHO to add yet another 
function to date.c which formats the time according to cvsserver's wishes.

Post 1.5.0.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git-fast-import
  2007-02-08 19:10                 ` git-fast-import Shawn O. Pearce
@ 2007-02-09  8:49                   ` Karl Hasselström
  2007-02-09 15:47                     ` git-fast-import Linus Torvalds
  0 siblings, 1 reply; 52+ messages in thread
From: Karl Hasselström @ 2007-02-09  8:49 UTC (permalink / raw)
  To: Shawn O. Pearce
  Cc: Linus Torvalds, Johannes Schindelin, Daniel Barkalow,
	Aneesh Kumar K.V, git

On 2007-02-08 14:10:24 -0500, Shawn O. Pearce wrote:

> Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> > Yeah, I think fast-import is great. And I'd also like to echo that
> > call to not call it "gfi". Maybe it's just me, and maybe it's just
> > because I'm a home-owner who does things like add in-wall ethernet
> > cables, but to me, gfi is about an electrical outlet.
>
> OK. There happen to be 78 uses of `gfi` in the manpage. I'll correct
> the spelling to fast-import. :-)

Didn't you listen to what Linus said? Near porcelain and plumbing is
precisely where you _need_ gfi!

-- 
Karl Hasselström, kha@treskal.com
      www.treskal.com/kalle

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git-fast-import
  2007-02-09  8:49                   ` git-fast-import Karl Hasselström
@ 2007-02-09 15:47                     ` Linus Torvalds
  0 siblings, 0 replies; 52+ messages in thread
From: Linus Torvalds @ 2007-02-09 15:47 UTC (permalink / raw)
  To: Karl Hasselström
  Cc: Shawn O. Pearce, Johannes Schindelin, Daniel Barkalow,
	Aneesh Kumar K.V, git

[-- Attachment #1: Type: TEXT/PLAIN, Size: 301 bytes --]



On Fri, 9 Feb 2007, Karl Hasselström wrote:
> 
> Didn't you listen to what Linus said? Near porcelain and plumbing is
> precisely where you _need_ gfi!

Groan.

The whole git project is apparently infected with a terminal case of the 
puns. 

And they keep on getting more and more obscure.

		Linus

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git-fast-import
  2006-08-06  3:40 ` git-fast-import Shawn Pearce
@ 2006-08-06  4:09   ` Jon Smirl
  0 siblings, 0 replies; 52+ messages in thread
From: Jon Smirl @ 2006-08-06  4:09 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: git

On 8/5/06, Shawn Pearce <spearce@spearce.org> wrote:
> Jon Smirl <jonsmirl@gmail.com> wrote:
> > git-fast-import works great. I parsed and built my pack file in
> > 1:45hr. That's way better than 24hr. I am still IO bound but that
> > seems to be an issue with not being able to read ahead 150K small
> > files. CPU utilization averages about 50%.
>
> Excellent.  Now if only the damn RCS files were in a more suitable
> format.  :-)
>
> > I didn't bother reading the sha ids back from fast-import, instead I
> > computed them in the python code. Python has a C library function for
> > sha1. That decouple the processes from each other. They would run in
> > parallel on SMP.
>
> At least you are IO bound and not CPU bound.  But it is silly for the
> importer in Python to be computing the SHA1 IDs and for fast-import
> to also be computing them.  Would it help if fast-import allowed
> you to feed in a tag string which it dumps to an output file listing
> SHA1 and the tag?  Then you can feed that data file back into your
> tree/commit processing for revision handling.

I am IO bound, there is plenty of CPU and I am on a 2.8Ghz single processor.
The sha1 is getting stored into an internal Python structure. The
structures then get sliced and diced a thousand ways to compute the
change sets.

The real goal of this is to use the cvs2svn code for change set
detection. Look at how much work these guys have put into it making it
work on the various messed up CVS repositories.
http://git.catalyst.net.nz/gitweb?p=cvs2svn.git;a=shortlog;h=a9167614a7acec27e122ccf948d1602ffe5a0c4b

cvs2svn is the only tool that read and built change sets for Moz CVS
on the first try.

> > My pack file is 980MB compared to 680MB from other attempts. I am
> > still missing entries for the trees and commits.
>
> The delta selection ain't the best.  It may be the case that prior
> attempts were combining files to get better delta chains vs. staying

My suspicion is that prior attempts weren't capturing all of the
revisions. I know cvsps (the 680MB repo) was throwing away branches
that it didn't understand. I don't think anyone got parsecvs to run to
completion. MozCVS has 1,500 branches.

> all in one file.  It may be the case that the branches are causing
> the delta chains to not be ideal.  I guess I expected slightly
> better but not that much; earlier attempts were around 700 MB so
> I thought maybe you'd be in the 800 MB ballpark.  Under 1 GB is
> still good though as it means its feasible to fit the damn thing
> into memory on almost any system, which makes it pretty repackable
> with the standard packing code.

I am still missing all of the commits and trees. Don't know how much
they will add yet.

> Its possible that you are also seeing duplicates in the pack;
> I actually wouldn't be surprised if at least 100 MB of that was
> duplicates where the author(s) reverted a file revision to an exact
> prior revision, such that the SHA1 IDs were the same.  fast-import
> (as I have previously said) is stupid and will write the content
> out twice rather than "reuse" the existing entry.
>
> Tonight I'll try to improve fast-import.c to include index
> generation, and at the same time perform duplicate removal.
> That should get you over the GPF in index-pack.c, may reduce disk
> usage a little for the new pack, and save you from having to perform
> a third pass on the new pack.

Sounds like a good plan.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: git-fast-import
  2006-08-06  2:51 git-fast-import Jon Smirl
@ 2006-08-06  3:40 ` Shawn Pearce
  2006-08-06  4:09   ` git-fast-import Jon Smirl
  0 siblings, 1 reply; 52+ messages in thread
From: Shawn Pearce @ 2006-08-06  3:40 UTC (permalink / raw)
  To: Jon Smirl; +Cc: git

Jon Smirl <jonsmirl@gmail.com> wrote:
> git-fast-import works great. I parsed and built my pack file in
> 1:45hr. That's way better than 24hr. I am still IO bound but that
> seems to be an issue with not being able to read ahead 150K small
> files. CPU utilization averages about 50%.

Excellent.  Now if only the damn RCS files were in a more suitable
format.  :-)
 
> I didn't bother reading the sha ids back from fast-import, instead I
> computed them in the python code. Python has a C library function for
> sha1. That decouple the processes from each other. They would run in
> parallel on SMP.

At least you are IO bound and not CPU bound.  But it is silly for the
importer in Python to be computing the SHA1 IDs and for fast-import
to also be computing them.  Would it help if fast-import allowed
you to feed in a tag string which it dumps to an output file listing
SHA1 and the tag?  Then you can feed that data file back into your
tree/commit processing for revision handling.

> My pack file is 980MB compared to 680MB from other attempts. I am
> still missing entries for the trees and commits.

The delta selection ain't the best.  It may be the case that prior
attempts were combining files to get better delta chains vs. staying
all in one file.  It may be the case that the branches are causing
the delta chains to not be ideal.  I guess I expected slightly
better but not that much; earlier attempts were around 700 MB so
I thought maybe you'd be in the 800 MB ballpark.  Under 1 GB is
still good though as it means its feasible to fit the damn thing
into memory on almost any system, which makes it pretty repackable
with the standard packing code.

Its possible that you are also seeing duplicates in the pack;
I actually wouldn't be surprised if at least 100 MB of that was
duplicates where the author(s) reverted a file revision to an exact
prior revision, such that the SHA1 IDs were the same.  fast-import
(as I have previously said) is stupid and will write the content
out twice rather than "reuse" the existing entry.


Tonight I'll try to improve fast-import.c to include index
generation, and at the same time perform duplicate removal.
That should get you over the GPF in index-pack.c, may reduce disk
usage a little for the new pack, and save you from having to perform
a third pass on the new pack.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* git-fast-import
@ 2006-08-06  2:51 Jon Smirl
  2006-08-06  3:40 ` git-fast-import Shawn Pearce
  0 siblings, 1 reply; 52+ messages in thread
From: Jon Smirl @ 2006-08-06  2:51 UTC (permalink / raw)
  To: git, Shawn Pearce

git-fast-import works great. I parsed and built my pack file in
1:45hr. That's way better than 24hr. I am still IO bound but that
seems to be an issue with not being able to read ahead 150K small
files. CPU utilization averages about 50%.

I didn't bother reading the sha ids back from fast-import, instead I
computed them in the python code. Python has a C library function for
sha1. That decouple the processes from each other. They would run in
parallel on SMP.

My pack file is 980MB compared to 680MB from other attempts. I am
still missing entries for the trees and commits.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2007-02-09 15:48 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-02-06  2:31 git-fast-import Shawn O. Pearce
2007-02-06  3:18 ` git-fast-import Nicolas Pitre
2007-02-06  4:06 ` git-fast-import Nicolas Pitre
2007-02-06  5:48   ` git-fast-import Shawn O. Pearce
2007-02-06 16:35     ` git-fast-import Linus Torvalds
2007-02-06 16:56       ` git-fast-import Shawn O. Pearce
2007-02-06 17:20         ` git-fast-import Linus Torvalds
2007-02-06 18:53           ` git-fast-import Nicolas Pitre
2007-02-06 20:09             ` git-fast-import Shawn O. Pearce
2007-02-06 21:03               ` git-fast-import Nicolas Pitre
2007-02-06 21:15                 ` git-fast-import Shawn O. Pearce
2007-02-06 21:42                   ` git-fast-import Nicolas Pitre
2007-02-07 10:58             ` git-fast-import David Woodhouse
2007-02-06  6:12 ` git-fast-import Aneesh Kumar K.V
2007-02-06  6:18   ` git-fast-import Shawn O. Pearce
2007-02-07  4:55     ` git-fast-import Daniel Barkalow
2007-02-07  9:13       ` git-fast-import Karl Hasselström
2007-02-07 11:17         ` git-fast-import Johannes Schindelin
2007-02-07 22:55           ` git-fast-import Shawn O. Pearce
2007-02-07 23:55             ` git-fast-import Johannes Schindelin
2007-02-08  0:12               ` git-fast-import Shawn O. Pearce
2007-02-08 16:56               ` git-fast-import Linus Torvalds
2007-02-08 19:10                 ` git-fast-import Shawn O. Pearce
2007-02-09  8:49                   ` git-fast-import Karl Hasselström
2007-02-09 15:47                     ` git-fast-import Linus Torvalds
2007-02-07  9:29       ` git-fast-import Raimund Bauer
2007-02-07 13:38       ` git-fast-import David Woodhouse
2007-02-06  9:28 ` git-fast-import Andy Parkins
2007-02-06  9:40   ` git-fast-import Shawn O. Pearce
2007-02-06 16:37   ` git-fast-import Linus Torvalds
2007-02-06 16:44     ` git-fast-import Shawn O. Pearce
2007-02-06 17:24       ` git-fast-import Linus Torvalds
2007-02-07  1:17       ` git-fast-import Horst H. von Brand
2007-02-07  2:50         ` git-fast-import Linus Torvalds
2007-02-07  5:53           ` git-fast-import Shawn O. Pearce
2007-02-07  9:21             ` git-fast-import Karl Hasselström
2007-02-07 22:18             ` git-fast-import Horst H. von Brand
2007-02-07 22:31               ` git-fast-import Jakub Narebski
2007-02-07 22:39               ` git-fast-import Linus Torvalds
2007-02-08 21:34           ` git-fast-import Johannes Schindelin
2007-02-07  5:46         ` git-fast-import Shawn O. Pearce
2007-02-07  4:45       ` git-fast-import Daniel Barkalow
2007-02-06  9:34 ` git-fast-import Jakub Narebski
2007-02-06  9:39   ` git-fast-import Shawn O. Pearce
2007-02-06  9:53 ` git-fast-import Jakub Narebski
2007-02-06 17:20   ` git-fast-import Shawn O. Pearce
2007-02-06 13:50 ` git-fast-import Alex Riesen
2007-02-06 17:43   ` git-fast-import Shawn O. Pearce
2007-02-06 18:02     ` git-fast-import Alex Riesen
  -- strict thread matches above, loose matches on Subject: below --
2006-08-06  2:51 git-fast-import Jon Smirl
2006-08-06  3:40 ` git-fast-import Shawn Pearce
2006-08-06  4:09   ` git-fast-import Jon Smirl

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.