Lexer.getCharIndex() return value not behaving as expected
See original GitHub issue- I have reproduced my issue using the latest version of ANTLR
- I have asked at stackoverflow
- Responses from the above seem to indicate that my issue could be an ANTLR bug
- I have done a search of the existing issues to make sure I’m not sending in a duplicate
Language: Java ANTLR Version: 4.9.3
parser grammar TestParser;
options { tokenVocab=TestLexer; }
root
: LINE+ EOF
;
lexer grammar TestLexer;
@lexer::members {
private int startIndex = 0;
private void updateStartIndex() {
startIndex = getCharIndex();
}
private void printNumber() {
String number = _input.getText(Interval.of(startIndex, getCharIndex() - 1));
System.out.println(number);
}
}
LINE: {getCharPositionInLine() == 0}? ANSWER SPACE {updateStartIndex();} NUMBER {printNumber();} DOT .+? NEWLINE;
OTHER: . -> skip;
fragment NUMBER: [0-9]+;
fragment ANSWER: '( ' [A-D] ' )';
fragment SPACE: ' ';
fragment NEWLINE: '\n';
fragment DOT: '.';
Execute the following code:
import org.antlr.v4.runtime.CharStream;
import org.antlr.v4.runtime.CharStreams;
import org.antlr.v4.runtime.CommonTokenStream;
import org.antlr.v4.runtime.Lexer;
import org.antlr.v4.runtime.tree.ParseTree;
public class TestParseTest {
public static void main(String[] args) {
CharStream charStream = CharStreams.fromString("( B ) 12. hahaha\n"+
"( B ) 123. hahaha\n");
Lexer lexer = new TestLexer(charStream);
CommonTokenStream tokens = new CommonTokenStream(lexer);
TestParser parser = new TestParser(tokens);
ParseTree parseTree = parser.root();
System.out.println(parseTree.toStringTree(parser));
}
}
The output is as follows:
12
12
(root ( B ) 12. hahaha\n ( B ) 123. hahaha\n <EOF>)
Expected output:
12
123
(root ( B ) 12. hahaha\n ( B ) 123. hahaha\n <EOF>)
Issue Analytics
- State:
- Created a year ago
- Comments:45 (45 by maintainers)
Top Results From Across the Web
ANTLR4: Lexer.getCharIndex() return value not behaving as ...
I know getText() can get the text matched by the entire lexer rule, but is there a convenient way to get the fragment...
Read more >C.stg - Google Git
This <type> file was generated by $ANTLR version <ANTLRVersion> ... get wchar_t, but wchar_t is 16 bits on Windows, which is not UTF32...
Read more >The Definitive ANTLR Reference - Ciências - 32 - Passei Direto
For lexer rules, note that labels on elements are sometimes characters, not tokens. Therefore, you can't reference token attributes on all labels.
Read more >Index (ANTLR 3 Runtime 3.5.3 API)
A generic recognizer that can handle recognizers generated from lexer, parser, and tree grammars. BaseRecognizer() - Constructor for class org.antlr.runtime ...
Read more >Index (ANTLR 4 Runtime 4.11.1 API)
Add state D to the DFA if it is not already present, and return the actual instance stored ... What alt (or lexer...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I implemented correct
ACTION_SHOULD_BE_PLACED_AFTER_PREDICATES
warning here: https://github.com/antlr/antlr4/pull/3626 I’ve checked for possible cases and added tests.It looks like we need a warning for this case as well. It’s also confusing, I’ve encoutered the problem several times (for example, detect if identifier matches
get
orset
string for JavaScript: https://github.com/antlr/grammars-v4/blob/master/javascript/javascript/JavaScriptParser.g4#L437-L443).Wonderful. For CSharp, I don’t get “11” but “01”.
But, yes, for Java, it’s “11”.
What a friggin’ mess.