Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Lexer.getCharIndex() return value not behaving as expected

See original GitHub issue

I have reproduced my issue using the latest version of ANTLR
I have asked at stackoverflow
Responses from the above seem to indicate that my issue could be an ANTLR bug
I have done a search of the existing issues to make sure I’m not sending in a duplicate

Language: Java ANTLR Version: 4.9.3

parser grammar TestParser;

options { tokenVocab=TestLexer; }

root
    : LINE+ EOF
    ;

lexer grammar TestLexer;

@lexer::members {
    private int startIndex = 0;

    private void updateStartIndex() {
        startIndex = getCharIndex();
    }

    private void printNumber() {
        String number = _input.getText(Interval.of(startIndex, getCharIndex() - 1));
        System.out.println(number);
    }
}

LINE:                          {getCharPositionInLine() == 0}? ANSWER SPACE {updateStartIndex();} NUMBER {printNumber();} DOT .+? NEWLINE;
OTHER:                         . -> skip;

fragment NUMBER:               [0-9]+;
fragment ANSWER:               '( ' [A-D] ' )';
fragment SPACE:                ' ';
fragment NEWLINE:              '\n';
fragment DOT:                  '.';

Execute the following code:

import org.antlr.v4.runtime.CharStream;
import org.antlr.v4.runtime.CharStreams;
import org.antlr.v4.runtime.CommonTokenStream;
import org.antlr.v4.runtime.Lexer;
import org.antlr.v4.runtime.tree.ParseTree;

public class TestParseTest {

    public static void main(String[] args) {
        CharStream charStream = CharStreams.fromString("( B ) 12. hahaha\n"+
                "( B ) 123. hahaha\n");
        Lexer lexer = new TestLexer(charStream);

        CommonTokenStream tokens = new CommonTokenStream(lexer);
        TestParser parser = new TestParser(tokens);
        ParseTree parseTree = parser.root();

        System.out.println(parseTree.toStringTree(parser));
    }

}

The output is as follows:

12
12
(root ( B ) 12. hahaha\n ( B ) 123. hahaha\n <EOF>)

Expected output:

12
123
(root ( B ) 12. hahaha\n ( B ) 123. hahaha\n <EOF>)

Issue Analytics

State:
Created a year ago
Comments:45 (45 by maintainers)

Top GitHub Comments

2reactions

KvanTTTcommented, Apr 7, 2022

I implemented correct ACTION_SHOULD_BE_PLACED_AFTER_PREDICATES warning here: https://github.com/antlr/antlr4/pull/3626 I’ve checked for possible cases and added tests.

which starts with “The parser will not evaluate predicates during prediction that occur after an action or token reference.”

It looks like we need a warning for this case as well. It’s also confusing, I’ve encoutered the problem several times (for example, detect if identifier matches get or set string for JavaScript: https://github.com/antlr/grammars-v4/blob/master/javascript/javascript/JavaScriptParser.g4#L437-L443).

1reaction

kaby76commented, Apr 5, 2022

Wonderful. For CSharp, I don’t get “11” but “01”.

lexer grammar LLexer;
A:
 {System.Console.WriteLine("first " + this.CharIndex);}
 'A'
 {System.Console.WriteLine("second " + this.CharIndex);}
 ;

parser grammar LParser;
options { tokenVocab = LLexer; }
file : A+ EOF;

But, yes, for Java, it’s “11”.

What a friggin’ mess.

Top Results From Across the Web

ANTLR4: Lexer.getCharIndex() return value not behaving as ...

I know getText() can get the text matched by the entire lexer rule, but is there a convenient way to get the fragment...

C.stg - Google Git

This <type> file was generated by $ANTLR version <ANTLRVersion> ... get wchar_t, but wchar_t is 16 bits on Windows, which is not UTF32...

The Definitive ANTLR Reference - Ciências - 32 - Passei Direto

For lexer rules, note that labels on elements are sometimes characters, not tokens. Therefore, you can't reference token attributes on all labels.

Index (ANTLR 3 Runtime 3.5.3 API)

A generic recognizer that can handle recognizers generated from lexer, parser, and tree grammars. BaseRecognizer() - Constructor for class org.antlr.runtime ...

Index (ANTLR 4 Runtime 4.11.1 API)

Add state D to the DFA if it is not already present, and return the actual instance stored ... What alt (or lexer...