All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC/PATCH] Added a remote helper to interact with mediawiki, pull & clone handled
@ 2011-06-02  9:28 Arnaud Lacurie
  2011-06-02 17:03 ` Jeff King
  2011-06-02 18:01 ` Junio C Hamano
  0 siblings, 2 replies; 8+ messages in thread
From: Arnaud Lacurie @ 2011-06-02  9:28 UTC (permalink / raw)
  To: git
  Cc: Arnaud Lacurie, Jérémie Nikaes, Claire Fousse,
	David Amouyal, Matthieu Moy, Sylvain Boulmé

Implements a gate between git and mediawiki, allowing git users to
push and pull objects from mediawiki just as one would do with a
classic git repository thanks to remote-helpers.

Currently supported commands are :
       git clone mediawiki::http://onewiki.com
       git pull

You need the following packages installed (available on common
repositories):
       libmediawiki-api-perl
       libdatetime-format-iso8601-perl

Remote helpers have been used in order to be as transparent as possible
to the git user.

Mediawiki revisions are downloaded through the Mediawiki API and then
fast-imported into git.

Mediawiki revisions and git commits are linked thanks to notes bound to
commits.

The import part is done on a refs/mediawiki/<remote> branch before
coming to refs/remote/origin/master (Huge thanks to Jonathan Nieder
for his help)

For now, the whole wiki is cloned, but it will be possible to clone only
some pages: the clone is based on a list of pages which is now all
pages.

Signed-off-by: Jérémie Nikaes <jeremie.nikaes@ensimag.imag.fr>
Signed-off-by: Arnaud Lacurie <arnaud.lacurie@ensimag.imag.fr>
Signed-off-by: Claire Fousse <claire.fousse@ensimag.imag.fr>
Signed-off-by: David Amouyal <david.amouyal@ensimag.imag.fr>
Signed-off-by: Matthieu Moy <matthieu.moy@grenoble-inp.fr>
Signed-off-by: Sylvain Boulmé <sylvain.boulme@imag.fr>
---
 contrib/mw-to-git/git-remote-mediawiki     |  252 ++++++++++++++++++++++++++++
 contrib/mw-to-git/git-remote-mediawiki.txt |    7 +
 2 files changed, 259 insertions(+), 0 deletions(-)
 create mode 100755 contrib/mw-to-git/git-remote-mediawiki
 create mode 100644 contrib/mw-to-git/git-remote-mediawiki.txt

diff --git a/contrib/mw-to-git/git-remote-mediawiki b/contrib/mw-to-git/git-remote-mediawiki
new file mode 100755
index 0000000..f20edc4
--- /dev/null
+++ b/contrib/mw-to-git/git-remote-mediawiki
@@ -0,0 +1,252 @@
+#! /usr/bin/perl
+
+use strict;
+use Switch;
+use MediaWiki::API;
+use Storable qw(freeze thaw);
+use DateTime::Format::ISO8601;
+use Encode qw(encode_utf8);
+
+my $remotename = $ARGV[0];
+my $url = $ARGV[1];
+
+print STDERR "$url\n";
+
+# commands parser
+my $loop = 1;
+my $entry;
+my @cmd;
+while ($loop) {
+	$| = 1; #flush STDOUT
+	$entry = <STDIN>;
+	print STDERR $entry;
+	chomp($entry);
+	@cmd = undef;
+	@cmd = split(/ /,$entry);
+	switch ($cmd[0]) {
+		case "capabilities" {
+			if ($cmd[1] eq "") {
+				mw_capabilities();
+			} else {
+			       $loop = 0;
+			}
+		}
+		case "list" {
+			if ($cmd[2] eq "") {
+				mw_list($cmd[1]);
+			} else {
+			       $loop = 0;
+			}
+		}
+		case "import" {
+			if ($cmd[1] ne "" && $cmd[2] eq "") {
+				mw_import($url);
+			} else {
+			       $loop = 0;
+			}
+		}
+		case "option" {
+			mw_option($cmd[1],$cmd[2]);
+		}
+		case "push" {
+			#check the pattern +<src>:<dist>
+			my @pushargs = split(/:/,$cmd[1]);
+			if ($pushargs[1] ne "" && $pushargs[2] eq ""
+			&& (substr($pushargs[0],0,1) eq "+")) {
+				mw_push(substr($pushargs[0],1),$pushargs[1]);
+			} else {
+			       $loop = 0;
+			}
+		} else {
+			$loop = 0;
+		}
+	}
+	close(FILE);
+}
+
+########################## Functions ##############################
+
+sub get_last_local_revision {
+	# Get last commit sha1
+	my $commit_sha1 = `git rev-parse refs/mediawiki/$remotename/master 2>/dev/null`;
+
+	# Get note regarding that commit
+	chomp($commit_sha1);
+	my $note = `git notes show $commit_sha1 2>/dev/null`;
+	my @note_info = split(/ /, $note);
+
+	my $lastrevision_number;
+	if (!($note_info[0] eq "mediawiki_revision:")) {
+		print STDERR "No previous mediawiki revision found, fetching from beginning\n";
+		$lastrevision_number = 0;
+	} else {
+		# Notes are formatted : mediawiki_revision: #number
+		$lastrevision_number = $note_info[1];
+		chomp($lastrevision_number);
+		print STDERR "Last mediawiki revision found is $lastrevision_number, fetching from here\n";
+	}
+	return $lastrevision_number;
+}
+
+sub mw_capabilities {
+	# Revisions are imported to the private namespace
+	# refs/mediawiki/$remotename/ by the helper and fetched into
+	# refs/remotes/$remotename later by fetch.
+	print STDOUT "refspec refs/heads/*:refs/mediawiki/$remotename/*\n";
+	print STDOUT "import\n";
+	print STDOUT "list\n";
+	print STDOUT "option\n";
+	print STDOUT "push\n";
+	print STDOUT "\n";
+}
+
+sub mw_list {
+	# MediaWiki do not have branches, we consider one branch arbitrarily
+	# called master
+	print STDOUT "? refs/heads/master\n";
+	print STDOUT '@'."refs/heads/master HEAD\n";
+	print STDOUT "\n";
+
+}
+
+sub mw_option {
+	print STDERR "not yet implemented \n";
+	print STDOUT "unsupported\n";
+}
+
+sub mw_import {
+	my @wiki_name = split(/:\/\//,$url);
+	my $wiki_name = $wiki_name[1];
+
+	my $mediawiki = MediaWiki::API->new;
+	$mediawiki->{config}->{api_url} = "$url/api.php";
+
+	my $pages = $mediawiki->list({
+		action => 'query',
+		list => 'allpages',
+		aplimit => 500,
+	});
+	if ($pages == undef) {
+		print STDERR "fatal: '$url' does not appear to be a mediawiki\n";
+		print STDERR "fatal: make sure '$url/api.php' is a valid page\n";
+		exit;
+	}
+
+	my @revisions;
+	print STDERR "Searching revisions...\n";
+	my $fetch_from = get_last_local_revision() + 1;
+	my $n = 1;
+	foreach my $page (@$pages) {
+		my $id = $page->{pageid};
+
+		print STDERR "$n/", scalar(@$pages), ": $page->{title}\n";
+		$n++;
+
+		my $query = {
+			action => 'query',
+			prop => 'revisions',
+			rvprop => 'ids',
+			rvdir => 'newer',
+			rvstartid => $fetch_from,
+			rvlimit => 500,
+			pageids => $page->{pageid},
+		};
+
+		my $revnum = 0;
+		# Get 500 revisions at a time due to the mediawiki api limit
+		while (1) {
+			my $result = $mediawiki->api($query);
+
+			# Parse each of those 500 revisions
+			foreach my $revision (@{$result->{query}->{pages}->{$id}->{revisions}}) {
+				my $page_rev_ids;
+				$page_rev_ids->{pageid} = $page->{pageid};
+				$page_rev_ids->{revid} = $revision->{revid};
+				push (@revisions, $page_rev_ids);
+				$revnum++;
+			}
+			last unless $result->{'query-continue'};
+			$query->{rvstartid} = $result->{'query-continue'}->{revisions}->{rvstartid};
+			print "\n";
+		}
+		print STDERR "  Found ", $revnum, " revision(s).\n";
+	}
+
+	# Creation of the fast-import stream
+	print STDERR "Fetching & writing export data...\n";
+	binmode STDOUT, ':binary';
+	$n = 0;
+
+	foreach my $pagerevids (sort {$a->{revid} <=> $b->{revid}} @revisions) {
+		#fetch the content of the pages
+		my $query = {
+			action => 'query',
+			prop => 'revisions',
+			rvprop => 'content|timestamp|comment|user|ids',
+			revids => $pagerevids->{revid},
+		};
+
+		my $result = $mediawiki->api($query);
+
+		my $rev = pop(@{$result->{query}->{pages}->{$pagerevids->{pageid}}->{revisions}});
+
+		$n++;
+		my $user = $rev->{user} || 'Anonymous';
+		my $dt = DateTime::Format::ISO8601->parse_datetime($rev->{timestamp});
+
+		my $comment = defined $rev->{comment} ? $rev->{comment} : '*Empty MediaWiki Message*';
+		my $title = $result->{query}->{pages}->{$pagerevids->{pageid}}->{title};
+		my $content = $rev->{'*'};
+		$title =~ y/ /_/;
+
+		print STDERR "$n/", scalar(@revisions), ": Revision n°$pagerevids->{revid} of $title\n";
+
+		print "commit refs/mediawiki/$remotename/master\n";
+		print "mark :$n\n";
+		print "committer $user <$user\@$wiki_name> ", $dt->epoch, " +0000\n";
+		print "data ", bytes::length(encode_utf8($comment)), "\n", encode_utf8($comment);
+		# If it's not a clone, needs to know where to start from
+		if ($fetch_from != 1 && $n == 1) {
+			print "from refs/mediawiki/$remotename/master^0\n";
+		}
+		print "M 644 inline $title.wiki\n";
+		print "data ", bytes::length(encode_utf8($content)), "\n", encode_utf8($content);
+		print "\n\n";
+
+
+		# mediawiki revision number in the git note
+		my $note_comment = encode_utf8("note added by git-mediawiki");
+		my $note_comment_length = bytes::length($note_comment);
+		my $note_content = encode_utf8("mediawiki_revision: " . $pagerevids->{revid} . "\n");
+		my $note_content_length = bytes::length($note_content);
+
+		if ($fetch_from == 1 && $n == 1) {
+			print "reset refs/notes/commits\n";
+		}
+		print "commit refs/notes/commits\n";
+		print "committer $user <user\@example.com> ", $dt->epoch, " +0000\n";
+		print "data ", $note_comment_length, "\n", $note_comment;
+		if ($fetch_from != 1 && $n == 1) {
+			print "from refs/notes/commits^0\n";
+		}
+		print "N inline :$n\n";
+		print "data ", $note_content_length, "\n", $note_content;
+		print "\n\n";
+	}
+
+	if ($fetch_from == 1) {
+		if ($n != 0) {
+			print "reset refs/heads/master\n";
+			print "from :$n\n";
+		} else {
+			print STDERR "You appear to have cloned an empty mediawiki\n";
+			#What do we have to do here ? If nothing is done, an error is thrown saying that
+			#HEAD is refering to unknown object 0000000000000000000
+		}
+	}
+
+}
+
+sub mw_push {
+	print STDERR "not yet implemented \n";
+}
diff --git a/contrib/mw-to-git/git-remote-mediawiki.txt b/contrib/mw-to-git/git-remote-mediawiki.txt
new file mode 100644
index 0000000..4d211f5
--- /dev/null
+++ b/contrib/mw-to-git/git-remote-mediawiki.txt
@@ -0,0 +1,7 @@
+Git-Mediawiki is a project which aims the creation of a gate
+between git and mediawiki, allowing git users to push and pull
+objects from mediawiki just as one would do with a classic git
+repository thanks to remote-helpers.
+
+For more information, visit the wiki at
+https://github.com/Bibzball/Git-Mediawiki/wiki
-- 
1.7.5.3.486.gdc5865

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [RFC/PATCH] Added a remote helper to interact with mediawiki, pull & clone handled
  2011-06-02  9:28 [RFC/PATCH] Added a remote helper to interact with mediawiki, pull & clone handled Arnaud Lacurie
@ 2011-06-02 17:03 ` Jeff King
  2011-06-02 20:28   ` Arnaud Lacurie
  2011-06-02 22:37   ` Matthieu Moy
  2011-06-02 18:01 ` Junio C Hamano
  1 sibling, 2 replies; 8+ messages in thread
From: Jeff King @ 2011-06-02 17:03 UTC (permalink / raw)
  To: Arnaud Lacurie
  Cc: git, Jérémie Nikaes, Claire Fousse, David Amouyal,
	Matthieu Moy, Sylvain Boulmé

On Thu, Jun 02, 2011 at 11:28:31AM +0200, Arnaud Lacurie wrote:

> +sub mw_import {
> [...]
> +		# Get 500 revisions at a time due to the mediawiki api limit
> +		while (1) {
> +			my $result = $mediawiki->api($query);
> +
> +			# Parse each of those 500 revisions
> +			foreach my $revision (@{$result->{query}->{pages}->{$id}->{revisions}}) {
> +				my $page_rev_ids;
> +				$page_rev_ids->{pageid} = $page->{pageid};
> +				$page_rev_ids->{revid} = $revision->{revid};
> +				push (@revisions, $page_rev_ids);
> +				$revnum++;
> +			}
> +			last unless $result->{'query-continue'};
> +			$query->{rvstartid} = $result->{'query-continue'}->{revisions}->{rvstartid};
> +			print "\n";
> +		}

What is this newline at the end here for? With it, my import reliably
fails with:

  fatal: Unsupported command: 
  fast-import: dumping crash report to .git/fast_import_crash_6091

Removing it seems to make things work.

> +		my $user = $rev->{user} || 'Anonymous';
> +		my $dt = DateTime::Format::ISO8601->parse_datetime($rev->{timestamp});
> +
> +		my $comment = defined $rev->{comment} ? $rev->{comment} : '*Empty MediaWiki Message*';

In importing the git wiki, I ran into an empty timestamp. This throws an
exception which kills the whole import:

  $ git clone mediawiki::https://git.wiki.kernel.org/ git-wiki
  2821/7949: Revision n°4210 of GitSurvey
  Invalid date format:  at /home/peff/compile/git/contrib/mw-to-git/git-remote-mediawiki line 195
          main::mw_import('https://git.wiki.kernel.org/') called at /home/peff/compile/git/contrib/mw-to-git/git-remote-mediawiki line 42

At the very least, we should intercept this and put in some placeholder
timestamp. I'm not sure what the best placeholder would be. Maybe use
the date from the previous revision, plus one second? Or maybe there is
some other bug causing us to have an empty timestamp. I didn't dig
deeper yet.

> +		# mediawiki revision number in the git note
> +		my $note_comment = encode_utf8("note added by git-mediawiki");
> +		my $note_comment_length = bytes::length($note_comment);
> +		my $note_content = encode_utf8("mediawiki_revision: " . $pagerevids->{revid} . "\n");
> +		my $note_content_length = bytes::length($note_content);
> +
> +		if ($fetch_from == 1 && $n == 1) {
> +			print "reset refs/notes/commits\n";
> +		}
> +		print "commit refs/notes/commits\n";

Should these go in refs/notes/commits? I don't think we have a "best
practices" yet for the notes namespaces, as it is still a relatively new
concept. But I always thought "refs/notes/commits" would be for the
user's "regular" notes, and that programmatic things would get their own
notes, like "refs/notes/mediawiki".

That wouldn't show them by default, but you could do:

  git log --notes=mediawiki

to see them (and maybe that is a feature, because most of the time you
won't care about the mediawiki revision).

> +		} else {
> +			print STDERR "You appear to have cloned an empty mediawiki\n";
> +			#What do we have to do here ? If nothing is done, an error is thrown saying that
> +			#HEAD is refering to unknown object 0000000000000000000
> +		}

Hmm. We do allow cloning empty git repos. It might be nice for there to
be some way for a remote helper to signal "everything OK, but the result
is empty". But I think that is probably something that needs to be added
to the remote-helper protocol, and so is outside the scope of your
script (maybe it is as simple as interpreting the null sha1 as "empty";
I dunno).

Overall, it's looking pretty good. I like that I can resume a
half-finished import via "git fetch". Though I do have one complaint:
running "git fetch" fetches the metainfo for every revision of every
page, just as it does for an initial clone. Is there something in the
mediawiki API to say "show me revisions since N" (where N would be the
mediawiki revision of the tip of what we imported)?

-Peff

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC/PATCH] Added a remote helper to interact with mediawiki, pull & clone handled
  2011-06-02  9:28 [RFC/PATCH] Added a remote helper to interact with mediawiki, pull & clone handled Arnaud Lacurie
  2011-06-02 17:03 ` Jeff King
@ 2011-06-02 18:01 ` Junio C Hamano
  2011-06-02 20:58   ` Arnaud Lacurie
  1 sibling, 1 reply; 8+ messages in thread
From: Junio C Hamano @ 2011-06-02 18:01 UTC (permalink / raw)
  To: Arnaud Lacurie
  Cc: git, Jérémie Nikaes, Claire Fousse, David Amouyal,
	Matthieu Moy, Sylvain Boulmé

Arnaud Lacurie <arnaud.lacurie@ensimag.imag.fr> writes:

>  contrib/mw-to-git/git-remote-mediawiki     |  252 ++++++++++++++++++++++++++++
>  contrib/mw-to-git/git-remote-mediawiki.txt |    7 +
>  2 files changed, 259 insertions(+), 0 deletions(-)

It is pleasing to see that a half of a custom backend can be done in just
250 lines of code.  I understand that this is a work-in-progress with many
unnecessary lines spitting debugging output to STDERR, whose removal will
further shrink the code?

> +# commands parser
> +my $loop = 1;
> +my $entry;
> +my @cmd;
> +while ($loop) {

This is somewhat unusual-looking loop control.

Wouldn't "while (1) { ...; last if (...); if (...) { last; } }" do?

> +	$| = 1; #flush STDOUT
> +	$entry = <STDIN>;
> +	print STDERR $entry;
> +	chomp($entry);
> +	@cmd = undef;
> +	@cmd = split(/ /,$entry);
> +	switch ($cmd[0]) {
> +		case "capabilities" {
> +			if ($cmd[1] eq "") {
> +				mw_capabilities();
> +			} else {
> +			       $loop = 0;

I presume that this is "We were expecting to read capabilities command but
found something unexpected; let's abort". Don't you want to say something
to the user here, perhaps on STDERR?

> +			}
> +		}
> ...
> +		case "option" {
> +			mw_option($cmd[1],$cmd[2]);
> +		}

No error checking only for this one?

> +		case "push" {
> +			#check the pattern +<src>:<dist>

The latter one is usually spelled <dst> standing for "destination".

> +			my @pushargs = split(/:/,$cmd[1]);
> +			if ($pushargs[1] ne "" && $pushargs[2] eq ""
> +			&& (substr($pushargs[0],0,1) eq "+")) {
> +				mw_push(substr($pushargs[0],1),$pushargs[1]);
> +			} else {
> +			       $loop = 0;
> +			}

Is "push" always forcing?

> +sub mw_import {
> +	my @wiki_name = split(/:\/\//,$url);
> +	my $wiki_name = $wiki_name[1];
> +
> +	my $mediawiki = MediaWiki::API->new;
> +	$mediawiki->{config}->{api_url} = "$url/api.php";
> +
> +	my $pages = $mediawiki->list({
> +		action => 'query',
> +		list => 'allpages',
> +		aplimit => 500,
> +	});
> +	if ($pages == undef) {
> +		print STDERR "fatal: '$url' does not appear to be a mediawiki\n";
> +		print STDERR "fatal: make sure '$url/api.php' is a valid page\n";
> +		exit;
> +	}
> +
> +	my @revisions;
> +	print STDERR "Searching revisions...\n";
> +	my $fetch_from = get_last_local_revision() + 1;
> +	my $n = 1;
> +	foreach my $page (@$pages) {
> +		my $id = $page->{pageid};
> +
> +		print STDERR "$n/", scalar(@$pages), ": $page->{title}\n";
> +		$n++;
> +
> +		my $query = {
> +			action => 'query',
> +			prop => 'revisions',
> +			rvprop => 'ids',
> +			rvdir => 'newer',
> +			rvstartid => $fetch_from,
> +			rvlimit => 500,
> +			pageids => $page->{pageid},
> +		};
> +
> +		my $revnum = 0;
> +		# Get 500 revisions at a time due to the mediawiki api limit

It's nice that you can dig deeper with rvlimit increments. I wonder if
'allpages' also let's you retrieve more than 500 pages in total by somehow
iterating over the set of pages.

> +	# Creation of the fast-import stream
> +	print STDERR "Fetching & writing export data...\n";
> +	binmode STDOUT, ':binary';
> +	$n = 0;
> +
> +	foreach my $pagerevids (sort {$a->{revid} <=> $b->{revid}} @revisions) {
> +		#fetch the content of the pages
> +		my $query = {
> +			action => 'query',
> +			prop => 'revisions',
> +			rvprop => 'content|timestamp|comment|user|ids',
> +			revids => $pagerevids->{revid},
> +		};
> +
> +		my $result = $mediawiki->api($query);
> +
> +		my $rev = pop(@{$result->{query}->{pages}->{$pagerevids->{pageid}}->{revisions}});

Is the list of per-page revisions guaranteed to be sorted (not a
rhetorical question; just asking)?

> +		print "commit refs/mediawiki/$remotename/master\n";
> +		print "mark :$n\n";
> +		print "committer $user <$user\@$wiki_name> ", $dt->epoch, " +0000\n";
> +		print "data ", bytes::length(encode_utf8($comment)), "\n", encode_utf8($comment);

Calling encode_utf8() twice on the same data?  How big is this $comment
typically?  Or does encode_utf8() somehow memoize?

> +		# If it's not a clone, needs to know where to start from
> +		if ($fetch_from != 1 && $n == 1) {
> +			print "from refs/mediawiki/$remotename/master^0\n";
> +		}
> +		print "M 644 inline $title.wiki\n";
> +		print "data ", bytes::length(encode_utf8($content)), "\n", encode_utf8($content);

Same for $content, which presumably is larger than $comment...

Perhaps a small helper

	sub literal_data {
        	my ($content) = @_;
                print "data ", bytes::length($content), "\n", $content;
	}

would help here, above, and below where you create a "note" on this
commit?

> +		# mediawiki revision number in the git note
> +		my $note_comment = encode_utf8("note added by git-mediawiki");
> +		my $note_comment_length = bytes::length($note_comment);
> +		my $note_content = encode_utf8("mediawiki_revision: " . $pagerevids->{revid} . "\n");
> +		my $note_content_length = bytes::length($note_content);
> +
> +		if ($fetch_from == 1 && $n == 1) {
> +			print "reset refs/notes/commits\n";
> +		}
> +		print "commit refs/notes/commits\n";
> +		print "committer $user <user\@example.com> ", $dt->epoch, " +0000\n";
> +		print "data ", $note_comment_length, "\n", $note_comment;

With that, this will become

	literal_data(encode_utf8("note added by git-mediawiki"));

and you don't need two extra variables.  Same for $note_content*.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC/PATCH] Added a remote helper to interact with mediawiki, pull & clone handled
  2011-06-02 17:03 ` Jeff King
@ 2011-06-02 20:28   ` Arnaud Lacurie
  2011-06-02 22:49     ` Jeff King
  2011-06-02 22:37   ` Matthieu Moy
  1 sibling, 1 reply; 8+ messages in thread
From: Arnaud Lacurie @ 2011-06-02 20:28 UTC (permalink / raw)
  To: Jeff King
  Cc: git, Jérémie Nikaes, Claire Fousse, David Amouyal,
	Matthieu Moy, Sylvain Boulmé

2011/6/2 Jeff King <peff@peff.net>:
> On Thu, Jun 02, 2011 at 11:28:31AM +0200, Arnaud Lacurie wrote:
>
>> +sub mw_import {
>> [...]
>> +             # Get 500 revisions at a time due to the mediawiki api limit
>> +             while (1) {
>> +                     my $result = $mediawiki->api($query);
>> +
>> +                     # Parse each of those 500 revisions
>> +                     foreach my $revision (@{$result->{query}->{pages}->{$id}->{revisions}}) {
>> +                             my $page_rev_ids;
>> +                             $page_rev_ids->{pageid} = $page->{pageid};
>> +                             $page_rev_ids->{revid} = $revision->{revid};
>> +                             push (@revisions, $page_rev_ids);
>> +                             $revnum++;
>> +                     }
>> +                     last unless $result->{'query-continue'};
>> +                     $query->{rvstartid} = $result->{'query-continue'}->{revisions}->{rvstartid};
>> +                     print "\n";
>> +             }
>
> What is this newline at the end here for? With it, my import reliably
> fails with:
>
>  fatal: Unsupported command:
>  fast-import: dumping crash report to .git/fast_import_crash_6091
>
> Removing it seems to make things work.

 Yes we actually found it today. It slipped as we've never fetched
pages with more than 500 revisions since it got there...

>> +             # mediawiki revision number in the git note
>> +             my $note_comment = encode_utf8("note added by git-mediawiki");
>> +             my $note_comment_length = bytes::length($note_comment);
>> +             my $note_content = encode_utf8("mediawiki_revision: " . $pagerevids->{revid} . "\n");
>> +             my $note_content_length = bytes::length($note_content);
>> +
>> +             if ($fetch_from == 1 && $n == 1) {
>> +                     print "reset refs/notes/commits\n";
>> +             }
>> +             print "commit refs/notes/commits\n";
>
> Should these go in refs/notes/commits? I don't think we have a "best
> practices" yet for the notes namespaces, as it is still a relatively new
> concept. But I always thought "refs/notes/commits" would be for the
> user's "regular" notes, and that programmatic things would get their own
> notes, like "refs/notes/mediawiki".
>
That's a good idea, we didn't think notes could actually not go in
refs/notes/commits. This will be perfect to distinguish the user notes
from ours.
>
>> +             } else {
>> +                     print STDERR "You appear to have cloned an empty mediawiki\n";
>> +                     #What do we have to do here ? If nothing is done, an error is thrown saying that
>> +                     #HEAD is refering to unknown object 0000000000000000000
>> +             }
>
> Hmm. We do allow cloning empty git repos. It might be nice for there to
> be some way for a remote helper to signal "everything OK, but the result
> is empty". But I think that is probably something that needs to be added
> to the remote-helper protocol, and so is outside the scope of your
> script (maybe it is as simple as interpreting the null sha1 as "empty";
> I dunno).
>

Yes, that's a problem we've been running into. We didn't really know
how to solve it.

> Overall, it's looking pretty good. I like that I can resume a
> half-finished import via "git fetch". Though I do have one complaint:
> running "git fetch" fetches the metainfo for every revision of every
> page, just as it does for an initial clone. Is there something in the
> mediawiki API to say "show me revisions since N" (where N would be the
> mediawiki revision of the tip of what we imported)?

I am not sure I understand your question. Because actually, we are
supporting this,
thanks to git notes. Like when you git fetch after a clone, it checks
only the last revisions

Thank you very much for your help !

Arnaud Lacurie

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC/PATCH] Added a remote helper to interact with mediawiki, pull & clone handled
  2011-06-02 18:01 ` Junio C Hamano
@ 2011-06-02 20:58   ` Arnaud Lacurie
  0 siblings, 0 replies; 8+ messages in thread
From: Arnaud Lacurie @ 2011-06-02 20:58 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Jérémie Nikaes, Claire Fousse, David Amouyal,
	Matthieu Moy, Sylvain Boulmé

2011/6/2 Junio C Hamano <gitster@pobox.com>:
> Arnaud Lacurie <arnaud.lacurie@ensimag.imag.fr> writes:
>
>>  contrib/mw-to-git/git-remote-mediawiki     |  252 ++++++++++++++++++++++++++++
>>  contrib/mw-to-git/git-remote-mediawiki.txt |    7 +
>>  2 files changed, 259 insertions(+), 0 deletions(-)
>
> It is pleasing to see that a half of a custom backend can be done in just
> 250 lines of code.  I understand that this is a work-in-progress with many
> unnecessary lines spitting debugging output to STDERR, whose removal will
> further shrink the code?

To give you an idea, the whole thing with the other part is about 400
lines for now.
There are a lot of output to STDERR that will be eventually removed.
We still have to
decide what "option verbosity" should display before removing to many of them.


>> +# commands parser
>> +my $loop = 1;
>> +my $entry;
>> +my @cmd;
>> +while ($loop) {
>
> This is somewhat unusual-looking loop control.
>
> Wouldn't "while (1) { ...; last if (...); if (...) { last; } }" do?

Ok for that loop control.

>> +     $| = 1; #flush STDOUT
>> +     $entry = <STDIN>;
>> +     print STDERR $entry;
>> +     chomp($entry);
>> +     @cmd = undef;
>> +     @cmd = split(/ /,$entry);
>> +     switch ($cmd[0]) {
>> +             case "capabilities" {
>> +                     if ($cmd[1] eq "") {
>> +                             mw_capabilities();
>> +                     } else {
>> +                            $loop = 0;
>
> I presume that this is "We were expecting to read capabilities command but
> found something unexpected; let's abort". Don't you want to say something
> to the user here, perhaps on STDERR?
>

Actually, we based that "no error messages displayed" policy on the
git-remote-http which aborts if something unexpected is found. We could add an
error message though.

>> +                     }
>> +             }
>> ...
>> +             case "option" {
>> +                     mw_option($cmd[1],$cmd[2]);
>> +             }
>
> No error checking only for this one?

There should be one. That's a miss.



>> +                     my @pushargs = split(/:/,$cmd[1]);
>> +                     if ($pushargs[1] ne "" && $pushargs[2] eq ""
>> +                     && (substr($pushargs[0],0,1) eq "+")) {
>> +                             mw_push(substr($pushargs[0],1),$pushargs[1]);
>> +                     } else {
>> +                            $loop = 0;
>> +                     }
>
> Is "push" always forcing?

The tests are done in the mw_push function which is still being implemented.
The push will abort if there is something to pull, just like git acts.

>> +sub mw_import {
>> +     my @wiki_name = split(/:\/\//,$url);
>> +     my $wiki_name = $wiki_name[1];
>> +
>> +     my $mediawiki = MediaWiki::API->new;
>> +     $mediawiki->{config}->{api_url} = "$url/api.php";
>> +
>> +     my $pages = $mediawiki->list({
>> +             action => 'query',
>> +             list => 'allpages',
>> +             aplimit => 500,
>> +     });
>> +     if ($pages == undef) {
>> +             print STDERR "fatal: '$url' does not appear to be a mediawiki\n";
>> +             print STDERR "fatal: make sure '$url/api.php' is a valid page\n";
>> +             exit;
>> +     }
>> +
>> +     my @revisions;
>> +     print STDERR "Searching revisions...\n";
>> +     my $fetch_from = get_last_local_revision() + 1;
>> +     my $n = 1;
>> +     foreach my $page (@$pages) {
>> +             my $id = $page->{pageid};
>> +
>> +             print STDERR "$n/", scalar(@$pages), ": $page->{title}\n";
>> +             $n++;
>> +
>> +             my $query = {
>> +                     action => 'query',
>> +                     prop => 'revisions',
>> +                     rvprop => 'ids',
>> +                     rvdir => 'newer',
>> +                     rvstartid => $fetch_from,
>> +                     rvlimit => 500,
>> +                     pageids => $page->{pageid},
>> +             };
>> +
>> +             my $revnum = 0;
>> +             # Get 500 revisions at a time due to the mediawiki api limit
>
> It's nice that you can dig deeper with rvlimit increments. I wonder if
> 'allpages' also let's you retrieve more than 500 pages in total by somehow
> iterating over the set of pages.

Yes it is possible. And works with the removal of the "\n" peff
pointed out in the previous message.

>> +     # Creation of the fast-import stream
>> +     print STDERR "Fetching & writing export data...\n";
>> +     binmode STDOUT, ':binary';
>> +     $n = 0;
>> +
>> +     foreach my $pagerevids (sort {$a->{revid} <=> $b->{revid}} @revisions) {
>> +             #fetch the content of the pages
>> +             my $query = {
>> +                     action => 'query',
>> +                     prop => 'revisions',
>> +                     rvprop => 'content|timestamp|comment|user|ids',
>> +                     revids => $pagerevids->{revid},
>> +             };
>> +
>> +             my $result = $mediawiki->api($query);
>> +
>> +             my $rev = pop(@{$result->{query}->{pages}->{$pagerevids->{pageid}}->{revisions}});
>
> Is the list of per-page revisions guaranteed to be sorted (not a
> rhetorical question; just asking)?

Yes it is.


>> +             print "commit refs/mediawiki/$remotename/master\n";
>> +             print "mark :$n\n";
>> +             print "committer $user <$user\@$wiki_name> ", $dt->epoch, " +0000\n";
>> +             print "data ", bytes::length(encode_utf8($comment)), "\n", encode_utf8($comment);
>
> Calling encode_utf8() twice on the same data?  How big is this $comment
> typically?  Or does encode_utf8() somehow memoize?
>
>> +             # If it's not a clone, needs to know where to start from
>> +             if ($fetch_from != 1 && $n == 1) {
>> +                     print "from refs/mediawiki/$remotename/master^0\n";
>> +             }
>> +             print "M 644 inline $title.wiki\n";
>> +             print "data ", bytes::length(encode_utf8($content)), "\n", encode_utf8($content);
>
> Same for $content, which presumably is larger than $comment...
>
> Perhaps a small helper
>
>        sub literal_data {
>                my ($content) = @_;
>                print "data ", bytes::length($content), "\n", $content;
>        }
>
> would help here, above, and below where you create a "note" on this
> commit?
>
>> +             # mediawiki revision number in the git note
>> +             my $note_comment = encode_utf8("note added by git-mediawiki");
>> +             my $note_comment_length = bytes::length($note_comment);
>> +             my $note_content = encode_utf8("mediawiki_revision: " . $pagerevids->{revid} . "\n");
>> +             my $note_content_length = bytes::length($note_content);
>> +
>> +             if ($fetch_from == 1 && $n == 1) {
>> +                     print "reset refs/notes/commits\n";
>> +             }
>> +             print "commit refs/notes/commits\n";
>> +             print "committer $user <user\@example.com> ", $dt->epoch, " +0000\n";
>> +             print "data ", $note_comment_length, "\n", $note_comment;
>
> With that, this will become
>
>        literal_data(encode_utf8("note added by git-mediawiki"));
>
> and you don't need two extra variables.  Same for $note_content*.

Thank you very much for this helper, it'll help factoring the code and
reducing the number of variables.


Thank you very much for your help and comments on that project.

Regards

-- 
Arnaud Lacurie

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC/PATCH] Added a remote helper to interact with mediawiki, pull & clone handled
  2011-06-02 17:03 ` Jeff King
  2011-06-02 20:28   ` Arnaud Lacurie
@ 2011-06-02 22:37   ` Matthieu Moy
  2011-06-03  3:43     ` Jeff King
  1 sibling, 1 reply; 8+ messages in thread
From: Matthieu Moy @ 2011-06-02 22:37 UTC (permalink / raw)
  To: Jeff King
  Cc: Arnaud Lacurie, git, Jérémie Nikaes, Claire Fousse,
	David Amouyal, Sylvain Boulmé

Jeff King <peff@peff.net> writes:

> Overall, it's looking pretty good. I like that I can resume a
> half-finished import via "git fetch". Though I do have one complaint:
> running "git fetch" fetches the metainfo for every revision of every
> page, just as it does for an initial clone. Is there something in the
> mediawiki API to say "show me revisions since N" (where N would be the
> mediawiki revision of the tip of what we imported)?

The idea is that we ultimately want to be able to import a subset of a
large wiki. In Wikipedia, for example, "show me revisions since N" will
be very large after a few minutes. OTOH, "show me revisions touching the
few pages I'm following" should be fast. And at least, it's O(imported
wiki size), not O(complete wiki size)

Ideally, there could be heuristics like

"show me how many revisions since N"
if (not many) {
    "OK, show me them all in details"
} else {
    "hmm, we'll do it another way, show me revisions touching my pages"
}

but let's not be too ambitious for now: it's a student's project,
completing one week from now, and the goal is to have something clean
and extensible. Bells and whistles will come later ;-).

-- 
Matthieu Moy
http://www-verimag.imag.fr/~moy/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC/PATCH] Added a remote helper to interact with mediawiki, pull & clone handled
  2011-06-02 20:28   ` Arnaud Lacurie
@ 2011-06-02 22:49     ` Jeff King
  0 siblings, 0 replies; 8+ messages in thread
From: Jeff King @ 2011-06-02 22:49 UTC (permalink / raw)
  To: Arnaud Lacurie
  Cc: git, Jérémie Nikaes, Claire Fousse, David Amouyal,
	Matthieu Moy, Sylvain Boulmé

On Thu, Jun 02, 2011 at 10:28:33PM +0200, Arnaud Lacurie wrote:

> > Overall, it's looking pretty good. I like that I can resume a
> > half-finished import via "git fetch". Though I do have one complaint:
> > running "git fetch" fetches the metainfo for every revision of every
> > page, just as it does for an initial clone. Is there something in the
> > mediawiki API to say "show me revisions since N" (where N would be the
> > mediawiki revision of the tip of what we imported)?
> 
> I am not sure I understand your question. Because actually, we are
> supporting this, thanks to git notes. Like when you git fetch after a
> clone, it checks only the last revisions

Sorry, I was partially wrong in what I wrote above. I was resuming a
failed import (because of the bogus timestamp), so the numbers of
revisions were still high, and I thought they were the same as in the
original version. I see now doing a fetch on the completed import that
it properly finds 0 revisions for each page. So that's good.

However, it does still take O(number of pages) requests just to find out
that there is nothing to fetch. For the git wiki, this takes on the
order of 1.5 minutes to do an empty fetch. When getting the list of
pages, I wonder if there is a way in the mediawiki API to say "show me
only pages which have been modified since rev N".

-Peff

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC/PATCH] Added a remote helper to interact with mediawiki, pull & clone handled
  2011-06-02 22:37   ` Matthieu Moy
@ 2011-06-03  3:43     ` Jeff King
  0 siblings, 0 replies; 8+ messages in thread
From: Jeff King @ 2011-06-03  3:43 UTC (permalink / raw)
  To: Matthieu Moy
  Cc: Arnaud Lacurie, git, Jérémie Nikaes, Claire Fousse,
	David Amouyal, Sylvain Boulmé

On Fri, Jun 03, 2011 at 12:37:04AM +0200, Matthieu Moy wrote:

> The idea is that we ultimately want to be able to import a subset of a
> large wiki. In Wikipedia, for example, "show me revisions since N" will
> be very large after a few minutes. OTOH, "show me revisions touching the
> few pages I'm following" should be fast. And at least, it's O(imported
> wiki size), not O(complete wiki size)

Yeah, I think what you want to do is dependent on wiki size. For a small
wiki, it doesn't matter; all pages is not much. For a large wiki, you
want a subset of the pages, and you _never_ want to do any operations on
the whole page space. In the middle are medium-sized wikis, where you
would like look at the whole page space, but ideally not in O(number of
pages).

But the point is somewhat moot, because having just read through the
mediawiki API, I've come to the conclusion (which seems familiar
from the last time I looked at this problem) that there is no way to ask
for what I want in a single query. That is, to say "show me all
revisions of all pages matching some subset X, that have been modified
since revision N". Or even "show me all pages matching some subset X
that have been modified since revision N", and then we could at least
cull the pages that haven't been touched.

But AFAICT, none of those is possible. I think we are stuck asking for
each page's information individually (you can even query multiple pages'
revision information simultaneously, but you can get only a single
revision from each in that case. There's not even a way to say "get me
the latest revision number for all of these pages).

One thing we could do to reduce the total run-time is to issue several
queries in parallel so that the query latency isn't so prevalent. I
don't know what a good level of parallelism is for a server like
wikipedia, though. I'm sure they don't appreciate users hammering the
servers too hard. Ideally you want just enough queries outstanding that
the remote server is always working on _one_, and the rest are doing
something else (traveling across the network, local processing and
storage, etc). But I'm not sure of a good way to measure that.

> but let's not be too ambitious for now: it's a student's project,
> completing one week from now, and the goal is to have something clean
> and extensible. Bells and whistles will come later ;-).

Yes. I think all of this is outside the scope of a student project. I
just like to dream. :)

-Peff

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2011-06-03  3:44 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-06-02  9:28 [RFC/PATCH] Added a remote helper to interact with mediawiki, pull & clone handled Arnaud Lacurie
2011-06-02 17:03 ` Jeff King
2011-06-02 20:28   ` Arnaud Lacurie
2011-06-02 22:49     ` Jeff King
2011-06-02 22:37   ` Matthieu Moy
2011-06-03  3:43     ` Jeff King
2011-06-02 18:01 ` Junio C Hamano
2011-06-02 20:58   ` Arnaud Lacurie

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.