All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Eric Sunshine via GitGitGadget" <gitgitgadget@gmail.com>
To: git@vger.kernel.org
Cc: "Jeff King" <peff@peff.net>, "Elijah Newren" <newren@gmail.com>,
	"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
	"Fabian Stelzer" <fs@gigacodes.de>,
	"Johannes Schindelin" <Johannes.Schindelin@gmx.de>,
	"Eric Sunshine" <sunshine@sunshineco.com>,
	"Eric Sunshine" <sunshine@sunshineco.com>
Subject: [PATCH 02/18] chainlint.pl: add POSIX shell lexical analyzer
Date: Thu, 01 Sep 2022 00:29:40 +0000	[thread overview]
Message-ID: <c1042b9bcd94b9ecb0bf73dfbd4334b9f30ba99a.1661992197.git.gitgitgadget@gmail.com> (raw)
In-Reply-To: <pull.1322.git.git.1661992197.gitgitgadget@gmail.com>

From: Eric Sunshine <sunshine@sunshineco.com>

Begin fleshing out chainlint.pl by adding a lexical analyzer for the
POSIX shell command language. The sole entry point Lexer::scan_token()
returns the next token from the input. It will be called by the upcoming
shell language parser.

Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
---
 t/chainlint.pl | 177 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 177 insertions(+)

diff --git a/t/chainlint.pl b/t/chainlint.pl
index e8ab95c7858..81ffbf28bf3 100755
--- a/t/chainlint.pl
+++ b/t/chainlint.pl
@@ -21,6 +21,183 @@ use Getopt::Long;
 my $show_stats;
 my $emit_all;
 
+# Lexer tokenizes POSIX shell scripts. It is roughly modeled after section 2.3
+# "Token Recognition" of POSIX chapter 2 "Shell Command Language". Although
+# similar to lexical analyzers for other languages, this one differs in a few
+# substantial ways due to quirks of the shell command language.
+#
+# For instance, in many languages, newline is just whitespace like space or
+# TAB, but in shell a newline is a command separator, thus a distinct lexical
+# token. A newline is significant and returned as a distinct token even at the
+# end of a shell comment.
+#
+# In other languages, `1+2` would typically be scanned as three tokens
+# (`1`, `+`, and `2`), but in shell it is a single token. However, the similar
+# `1 + 2`, which embeds whitepace, is scanned as three token in shell, as well.
+# In shell, several characters with special meaning lose that meaning when not
+# surrounded by whitespace. For instance, the negation operator `!` is special
+# when standing alone surrounded by whitespace; whereas in `foo!uucp` it is
+# just a plain character in the longer token "foo!uucp". In many other
+# languages, `"string"/foo:'string'` might be scanned as five tokens ("string",
+# `/`, `foo`, `:`, and 'string'), but in shell, it is just a single token.
+#
+# The lexical analyzer for the shell command language is also somewhat unusual
+# in that it recursively invokes the parser to handle the body of `$(...)`
+# expressions which can contain arbitrary shell code. Such expressions may be
+# encountered both inside and outside of double-quoted strings.
+#
+# The lexical analyzer is responsible for consuming shell here-doc bodies which
+# extend from the line following a `<<TAG` operator until a line consisting
+# solely of `TAG`. Here-doc consumption begins when a newline is encountered.
+# It is legal for multiple here-doc `<<TAG` operators to be present on a single
+# line, in which case their bodies must be present one following the next, and
+# are consumed in the (left-to-right) order the `<<TAG` operators appear on the
+# line. A special complication is that the bodies of all here-docs must be
+# consumed when the newline is encountered even if the parse context depth has
+# changed. For instance, in `cat <<A && x=$(cat <<B &&\n`, bodies of here-docs
+# "A" and "B" must be consumed even though "A" was introduced outside the
+# recursive parse context in which "B" was introduced and in which the newline
+# is encountered.
+package Lexer;
+
+sub new {
+	my ($class, $parser, $s) = @_;
+	bless {
+		parser => $parser,
+		buff => $s,
+		heretags => []
+	} => $class;
+}
+
+sub scan_heredoc_tag {
+	my $self = shift @_;
+	${$self->{buff}} =~ /\G(-?)/gc;
+	my $indented = $1;
+	my $tag = $self->scan_token();
+	$tag =~ s/['"\\]//g;
+	push(@{$self->{heretags}}, $indented ? "\t$tag" : "$tag");
+	return "<<$indented$tag";
+}
+
+sub scan_op {
+	my ($self, $c) = @_;
+	my $b = $self->{buff};
+	return $c unless $$b =~ /\G(.)/sgc;
+	my $cc = $c . $1;
+	return scan_heredoc_tag($self) if $cc eq '<<';
+	return $cc if $cc =~ /^(?:&&|\|\||>>|;;|<&|>&|<>|>\|)$/;
+	pos($$b)--;
+	return $c;
+}
+
+sub scan_sqstring {
+	my $self = shift @_;
+	${$self->{buff}} =~ /\G([^']*'|.*\z)/sgc;
+	return "'" . $1;
+}
+
+sub scan_dqstring {
+	my $self = shift @_;
+	my $b = $self->{buff};
+	my $s = '"';
+	while (1) {
+		# slurp up non-special characters
+		$s .= $1 if $$b =~ /\G([^"\$\\]+)/gc;
+		# handle special characters
+		last unless $$b =~ /\G(.)/sgc;
+		my $c = $1;
+		$s .= '"', last if $c eq '"';
+		$s .= '$' . $self->scan_dollar(), next if $c eq '$';
+		if ($c eq '\\') {
+			$s .= '\\', last unless $$b =~ /\G(.)/sgc;
+			$c = $1;
+			next if $c eq "\n"; # line splice
+			# backslash escapes only $, `, ", \ in dq-string
+			$s .= '\\' unless $c =~ /^[\$`"\\]$/;
+			$s .= $c;
+			next;
+		}
+		die("internal error scanning dq-string '$c'\n");
+	}
+	return $s;
+}
+
+sub scan_balanced {
+	my ($self, $c1, $c2) = @_;
+	my $b = $self->{buff};
+	my $depth = 1;
+	my $s = $c1;
+	while ($$b =~ /\G([^\Q$c1$c2\E]*(?:[\Q$c1$c2\E]|\z))/gc) {
+		$s .= $1;
+		$depth++, next if $s =~ /\Q$c1\E$/;
+		$depth--;
+		last if $depth == 0;
+	}
+	return $s;
+}
+
+sub scan_subst {
+	my $self = shift @_;
+	my @tokens = $self->{parser}->parse(qr/^\)$/);
+	$self->{parser}->next_token(); # closing ")"
+	return @tokens;
+}
+
+sub scan_dollar {
+	my $self = shift @_;
+	my $b = $self->{buff};
+	return $self->scan_balanced('(', ')') if $$b =~ /\G\((?=\()/gc; # $((...))
+	return '(' . join(' ', $self->scan_subst()) . ')' if $$b =~ /\G\(/gc; # $(...)
+	return $self->scan_balanced('{', '}') if $$b =~ /\G\{/gc; # ${...}
+	return $1 if $$b =~ /\G(\w+)/gc; # $var
+	return $1 if $$b =~ /\G([@*#?$!0-9-])/gc; # $*, $1, $$, etc.
+	return '';
+}
+
+sub swallow_heredocs {
+	my $self = shift @_;
+	my $b = $self->{buff};
+	my $tags = $self->{heretags};
+	while (my $tag = shift @$tags) {
+		my $indent = $tag =~ s/^\t// ? '\\s*' : '';
+		$$b =~ /(?:\G|\n)$indent\Q$tag\E(?:\n|\z)/gc;
+	}
+}
+
+sub scan_token {
+	my $self = shift @_;
+	my $b = $self->{buff};
+	my $token = '';
+RESTART:
+	$$b =~ /\G[ \t]+/gc; # skip whitespace (but not newline)
+	return "\n" if $$b =~ /\G#[^\n]*(?:\n|\z)/gc; # comment
+	while (1) {
+		# slurp up non-special characters
+		$token .= $1 if $$b =~ /\G([^\\;&|<>(){}'"\$\s]+)/gc;
+		# handle special characters
+		last unless $$b =~ /\G(.)/sgc;
+		my $c = $1;
+		last if $c =~ /^[ \t]$/; # whitespace ends token
+		pos($$b)--, last if length($token) && $c =~ /^[;&|<>(){}\n]$/;
+		$token .= $self->scan_sqstring(), next if $c eq "'";
+		$token .= $self->scan_dqstring(), next if $c eq '"';
+		$token .= $c . $self->scan_dollar(), next if $c eq '$';
+		$self->swallow_heredocs(), $token = $c, last if $c eq "\n";
+		$token = $self->scan_op($c), last if $c =~ /^[;&|<>]$/;
+		$token = $c, last if $c =~ /^[(){}]$/;
+		if ($c eq '\\') {
+			$token .= '\\', last unless $$b =~ /\G(.)/sgc;
+			$c = $1;
+			next if $c eq "\n" && length($token); # line splice
+			goto RESTART if $c eq "\n"; # line splice
+			$token .= '\\' . $c;
+			next;
+		}
+		die("internal error scanning character '$c'\n");
+	}
+	return length($token) ? $token : undef;
+}
+
 package ScriptParser;
 
 sub new {
-- 
gitgitgadget


  parent reply	other threads:[~2022-09-01  0:30 UTC|newest]

Thread overview: 51+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-01  0:29 [PATCH 00/18] make test "linting" more comprehensive Eric Sunshine via GitGitGadget
2022-09-01  0:29 ` [PATCH 01/18] t: add skeleton chainlint.pl Eric Sunshine via GitGitGadget
2022-09-01 12:27   ` Ævar Arnfjörð Bjarmason
2022-09-02 18:53     ` Eric Sunshine
2022-09-01  0:29 ` Eric Sunshine via GitGitGadget [this message]
2022-09-01 12:32   ` [PATCH 02/18] chainlint.pl: add POSIX shell lexical analyzer Ævar Arnfjörð Bjarmason
2022-09-03  6:00     ` Eric Sunshine
2022-09-01  0:29 ` [PATCH 03/18] chainlint.pl: add POSIX shell parser Eric Sunshine via GitGitGadget
2022-09-01  0:29 ` [PATCH 04/18] chainlint.pl: add parser to validate tests Eric Sunshine via GitGitGadget
2022-09-01  0:29 ` [PATCH 05/18] chainlint.pl: add parser to identify test definitions Eric Sunshine via GitGitGadget
2022-09-01  0:29 ` [PATCH 06/18] chainlint.pl: validate test scripts in parallel Eric Sunshine via GitGitGadget
2022-09-01 12:36   ` Ævar Arnfjörð Bjarmason
2022-09-03  7:51     ` Eric Sunshine
2022-09-06 22:35   ` Eric Wong
2022-09-06 22:52     ` Eric Sunshine
2022-09-06 23:26       ` Jeff King
2022-11-21  4:02         ` Eric Sunshine
2022-11-21 13:28           ` Ævar Arnfjörð Bjarmason
2022-11-21 14:07             ` Eric Sunshine
2022-11-21 14:18               ` Ævar Arnfjörð Bjarmason
2022-11-21 14:48                 ` Eric Sunshine
2022-11-21 18:04           ` Jeff King
2022-11-21 18:47             ` Eric Sunshine
2022-11-21 18:50               ` Eric Sunshine
2022-11-21 18:52               ` Jeff King
2022-11-21 19:00                 ` Eric Sunshine
2022-11-21 19:28                   ` Jeff King
2022-11-22  0:11                   ` Ævar Arnfjörð Bjarmason
2022-09-01  0:29 ` [PATCH 07/18] chainlint.pl: don't require `return|exit|continue` to end with `&&` Eric Sunshine via GitGitGadget
2022-09-01  0:29 ` [PATCH 08/18] t/Makefile: apply chainlint.pl to existing self-tests Eric Sunshine via GitGitGadget
2022-09-01  0:29 ` [PATCH 09/18] chainlint.pl: don't require `&` background command to end with `&&` Eric Sunshine via GitGitGadget
2022-09-01  0:29 ` [PATCH 10/18] chainlint.pl: don't flag broken &&-chain if `$?` handled explicitly Eric Sunshine via GitGitGadget
2022-09-01  0:29 ` [PATCH 11/18] chainlint.pl: don't flag broken &&-chain if failure indicated explicitly Eric Sunshine via GitGitGadget
2022-09-01  0:29 ` [PATCH 12/18] chainlint.pl: complain about loops lacking explicit failure handling Eric Sunshine via GitGitGadget
2022-09-01  0:29 ` [PATCH 13/18] chainlint.pl: allow `|| echo` to signal failure upstream of a pipe Eric Sunshine via GitGitGadget
2022-09-01  0:29 ` [PATCH 14/18] t/chainlint: add more chainlint.pl self-tests Eric Sunshine via GitGitGadget
2022-09-01  0:29 ` [PATCH 15/18] test-lib: retire "lint harder" optimization hack Eric Sunshine via GitGitGadget
2022-09-01  0:29 ` [PATCH 16/18] test-lib: replace chainlint.sed with chainlint.pl Eric Sunshine via GitGitGadget
2022-09-03  5:07   ` Elijah Newren
2022-09-03  5:24     ` Eric Sunshine
2022-09-01  0:29 ` [PATCH 17/18] t/Makefile: teach `make test` and `make prove` to run chainlint.pl Eric Sunshine via GitGitGadget
2022-09-01  0:29 ` [PATCH 18/18] t: retire unused chainlint.sed Eric Sunshine via GitGitGadget
2022-09-02 12:42   ` several messages Johannes Schindelin
2022-09-02 18:16     ` Eric Sunshine
2022-09-02 18:34       ` Jeff King
2022-09-02 18:44         ` Junio C Hamano
2022-09-11  5:28 ` [PATCH 00/18] make test "linting" more comprehensive Jeff King
2022-09-11  7:01   ` Eric Sunshine
2022-09-11 18:31     ` Jeff King
2022-09-12 23:17       ` Eric Sunshine
2022-09-13  0:04         ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c1042b9bcd94b9ecb0bf73dfbd4334b9f30ba99a.1661992197.git.gitgitgadget@gmail.com \
    --to=gitgitgadget@gmail.com \
    --cc=Johannes.Schindelin@gmx.de \
    --cc=avarab@gmail.com \
    --cc=fs@gigacodes.de \
    --cc=git@vger.kernel.org \
    --cc=newren@gmail.com \
    --cc=peff@peff.net \
    --cc=sunshine@sunshineco.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.