question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Lexer.getCharIndex() return value not behaving as expected

See original GitHub issue
  • I have reproduced my issue using the latest version of ANTLR
  • I have asked at stackoverflow
  • Responses from the above seem to indicate that my issue could be an ANTLR bug
  • I have done a search of the existing issues to make sure I’m not sending in a duplicate

Language: Java ANTLR Version: 4.9.3

parser grammar TestParser;

options { tokenVocab=TestLexer; }

root
    : LINE+ EOF
    ;
lexer grammar TestLexer;

@lexer::members {
    private int startIndex = 0;

    private void updateStartIndex() {
        startIndex = getCharIndex();
    }

    private void printNumber() {
        String number = _input.getText(Interval.of(startIndex, getCharIndex() - 1));
        System.out.println(number);
    }
}

LINE:                          {getCharPositionInLine() == 0}? ANSWER SPACE {updateStartIndex();} NUMBER {printNumber();} DOT .+? NEWLINE;
OTHER:                         . -> skip;

fragment NUMBER:               [0-9]+;
fragment ANSWER:               '( ' [A-D] ' )';
fragment SPACE:                ' ';
fragment NEWLINE:              '\n';
fragment DOT:                  '.';

Execute the following code:

import org.antlr.v4.runtime.CharStream;
import org.antlr.v4.runtime.CharStreams;
import org.antlr.v4.runtime.CommonTokenStream;
import org.antlr.v4.runtime.Lexer;
import org.antlr.v4.runtime.tree.ParseTree;

public class TestParseTest {

    public static void main(String[] args) {
        CharStream charStream = CharStreams.fromString("( B ) 12. hahaha\n"+
                "( B ) 123. hahaha\n");
        Lexer lexer = new TestLexer(charStream);

        CommonTokenStream tokens = new CommonTokenStream(lexer);
        TestParser parser = new TestParser(tokens);
        ParseTree parseTree = parser.root();

        System.out.println(parseTree.toStringTree(parser));
    }

}

The output is as follows:

12
12
(root ( B ) 12. hahaha\n ( B ) 123. hahaha\n <EOF>)

Expected output:

12
123
(root ( B ) 12. hahaha\n ( B ) 123. hahaha\n <EOF>)

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:45 (45 by maintainers)

github_iconTop GitHub Comments

2reactions
KvanTTTcommented, Apr 7, 2022

I implemented correct ACTION_SHOULD_BE_PLACED_AFTER_PREDICATES warning here: https://github.com/antlr/antlr4/pull/3626 I’ve checked for possible cases and added tests.

which starts with “The parser will not evaluate predicates during prediction that occur after an action or token reference.”

It looks like we need a warning for this case as well. It’s also confusing, I’ve encoutered the problem several times (for example, detect if identifier matches get or set string for JavaScript: https://github.com/antlr/grammars-v4/blob/master/javascript/javascript/JavaScriptParser.g4#L437-L443).

1reaction
kaby76commented, Apr 5, 2022

Wonderful. For CSharp, I don’t get “11” but “01”.

lexer grammar LLexer;
A:
 {System.Console.WriteLine("first " + this.CharIndex);}
 'A'
 {System.Console.WriteLine("second " + this.CharIndex);}
 ;
parser grammar LParser;
options { tokenVocab = LLexer; }
file : A+ EOF;

But, yes, for Java, it’s “11”.

What a friggin’ mess.

Read more comments on GitHub >

github_iconTop Results From Across the Web

ANTLR4: Lexer.getCharIndex() return value not behaving as ...
I know getText() can get the text matched by the entire lexer rule, but is there a convenient way to get the fragment...
Read more >
C.stg - Google Git
This <type> file was generated by $ANTLR version <ANTLRVersion> ... get wchar_t, but wchar_t is 16 bits on Windows, which is not UTF32...
Read more >
The Definitive ANTLR Reference - Ciências - 32 - Passei Direto
For lexer rules, note that labels on elements are sometimes characters, not tokens. Therefore, you can't reference token attributes on all labels.
Read more >
Index (ANTLR 3 Runtime 3.5.3 API)
A generic recognizer that can handle recognizers generated from lexer, parser, and tree grammars. BaseRecognizer() - Constructor for class org.antlr.runtime ...
Read more >
Index (ANTLR 4 Runtime 4.11.1 API)
Add state D to the DFA if it is not already present, and return the actual instance stored ... What alt (or lexer...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found