Cannot load text correctly by CsvParserPlugin if it includes `\0`.
See original GitHub issueHi maintainers,
We sometimes encounter a problem CSV Parser cannot parse string containing \0
.
This seems to be caused by CSV Parser judging that \0
is the end of a sentence.
How can we avoid this problem? or Could you solve the problem?
See the following code to reproduce the problem.
$ ruby -e '5.times {|i| puts "#{i}\taaaaa\0aaa#{i}a\tname#{i}"}' > data.tsv
$ cat data.tsv # the below text includes `\0`
0 aaaaaaaa0a name0
1 aaaaaaaa1a name1
2 aaaaaaaa2a name2
3 aaaaaaaa3a name3
4 aaaaaaaa4a name4
$ cat config.yml
in:
type: file
path_prefix: ./data.tsv
parser:
type: csv
delimiter: "\t"
skip_header_lines: 0
columns:
- { name: id, type: long }
- { name: description, type: string }
- { name: name, type: string }
stop_on_invalid_record: true
out:
type: stdout
$ embulk run config.yml
2017-05-16 13:24:04.405 +0900: Embulk v0.8.21
2017-05-16 13:24:06.004 +0900 [INFO] (0001:transaction): Listing local files at directory '.' filtering filename by prefix 'data.tsv'
2017-05-16 13:24:06.006 +0900 [INFO] (0001:transaction): "follow_symlinks" is set false. Note that symbolic links to directories are skipped.
2017-05-16 13:24:06.010 +0900 [INFO] (0001:transaction): Loading files [data.tsv]
2017-05-16 13:24:06.055 +0900 [INFO] (0001:transaction): Using local thread executor with max_threads=16 / output tasks 8 = input tasks 1 * 8
2017-05-16 13:24:06.062 +0900 [INFO] (0001:transaction): {done: 0 / 1, running: 0}
2017-05-16 13:24:06.121 +0900 [INFO] (0001:transaction): {done: 1 / 1, running: 0}
org.embulk.exec.PartialExecutionException: org.embulk.spi.DataException: Invalid record at line 1: 0 aaaaaaaa0a name0
at org.embulk.exec.BulkLoader$LoaderState.buildPartialExecuteException(org/embulk/exec/BulkLoader.java:373)
at org.embulk.exec.BulkLoader.doRun(org/embulk/exec/BulkLoader.java:591)
at org.embulk.exec.BulkLoader.access$000(org/embulk/exec/BulkLoader.java:33)
at org.embulk.exec.BulkLoader$1.run(org/embulk/exec/BulkLoader.java:389)
at org.embulk.exec.BulkLoader$1.run(org/embulk/exec/BulkLoader.java:385)
at org.embulk.spi.Exec.doWith(org/embulk/spi/Exec.java:25)
at org.embulk.exec.BulkLoader.run(org/embulk/exec/BulkLoader.java:385)
at org.embulk.EmbulkEmbed.run(org/embulk/EmbulkEmbed.java:180)
at java.lang.reflect.Method.invoke(java/lang/reflect/Method.java:498)
at org.jruby.javasupport.JavaMethod.invokeDirectWithExceptionHandling(org/jruby/javasupport/JavaMethod.java:453)
at org.jruby.javasupport.JavaMethod.invokeDirect(org/jruby/javasupport/JavaMethod.java:314)
at RUBY.run(/Users/takahiro.nakayama/bin/embulk!/embulk/runner.rb:84)
at RUBY.run(/Users/takahiro.nakayama/bin/embulk!/embulk/command/embulk_run.rb:269)
at RUBY.<main>(/Users/takahiro.nakayama/bin/embulk!/embulk/command/embulk_main.rb:2)
at org.jruby.Ruby.runInterpreter(org/jruby/Ruby.java:850)
at org.jruby.Ruby.loadFile(org/jruby/Ruby.java:2976)
at org.jruby.RubyKernel.requireCommon(org/jruby/RubyKernel.java:963)
at org.jruby.RubyKernel.require(org/jruby/RubyKernel.java:956)
at org.jruby.RubyKernel$INVOKER$s$1$0$require19.call(org/jruby/RubyKernel$INVOKER$s$1$0$require19.gen)
at RUBY.(root)(uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/rubygems/core_ext/kernel_require.rb:1)
at Users.takahiro_dot_nakayama.bin.embulk.embulk.command.embulk_bundle.invokeOther66:require(Users/takahiro_dot_nakayama/bin/embulk/embulk/command/file:/Users/takahiro.nakayama/bin/embulk!/embulk/command/embulk_bundle.rb:51)
at Users.takahiro_dot_nakayama.bin.embulk.embulk.command.embulk_bundle.<main>(file:/Users/takahiro.nakayama/bin/embulk!/embulk/command/embulk_bundle.rb:51)
at java.lang.invoke.MethodHandle.invokeWithArguments(java/lang/invoke/MethodHandle.java:627)
at org.jruby.Ruby.runScript(org/jruby/Ruby.java:834)
at org.jruby.Ruby.runNormally(org/jruby/Ruby.java:749)
at org.jruby.Ruby.runNormally(org/jruby/Ruby.java:767)
at org.jruby.Ruby.runFromMain(org/jruby/Ruby.java:580)
at org.jruby.Main.doRunFromMain(org/jruby/Main.java:425)
at org.jruby.Main.internalRun(org/jruby/Main.java:313)
at org.jruby.Main.run(org/jruby/Main.java:242)
at org.jruby.Main.main(org/jruby/Main.java:204)
at org.embulk.cli.Main.main(org/embulk/cli/Main.java:23)
Caused by: org.embulk.spi.DataException: Invalid record at line 1: 0 aaaaaaaa0a name0
at org.embulk.standards.CsvParserPlugin.run(CsvParserPlugin.java:368)
at org.embulk.spi.FileInputRunner.run(FileInputRunner.java:156)
at org.embulk.exec.LocalExecutorPlugin$ScatterExecutor.runInputTask(LocalExecutorPlugin.java:294)
at org.embulk.exec.LocalExecutorPlugin$ScatterExecutor.access$000(LocalExecutorPlugin.java:212)
at org.embulk.exec.LocalExecutorPlugin$ScatterExecutor$1.call(LocalExecutorPlugin.java:257)
at org.embulk.exec.LocalExecutorPlugin$ScatterExecutor$1.call(LocalExecutorPlugin.java:253)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.embulk.standards.CsvTokenizer$TooFewColumnsException: Too few columns
at org.embulk.standards.CsvTokenizer.nextColumn(CsvTokenizer.java:168)
at org.embulk.standards.CsvTokenizer.nextColumnOrNull(CsvTokenizer.java:379)
at org.embulk.standards.CsvParserPlugin$1.nextColumn(CsvParserPlugin.java:346)
at org.embulk.standards.CsvParserPlugin$1.stringColumn(CsvParserPlugin.java:302)
at org.embulk.spi.Column.visit(Column.java:58)
at org.embulk.spi.Schema.visitColumns(Schema.java:81)
at org.embulk.standards.CsvParserPlugin.run(CsvParserPlugin.java:259)
... 9 more
Error: org.embulk.spi.DataException: Invalid record at line 1: 0 aaaaaaaa0a name0
Issue Analytics
- State:
- Created 6 years ago
- Comments:5 (5 by maintainers)
Top Results From Across the Web
javascript: Uncaught TypeError: Cannot read property 'objects ...
I am trying to use the jQuery CSV parser plugin. However I got this console message: Uncaught TypeError ...
Read more >Tackling the most common errors when trying to import a CSV
Everyone who has imported data has run into import errors - it doesn't matter if you are a first-timer, or a seasoned pro...
Read more >Configuration - Embulk
parser: If the input is file-based, parser plugin parses a file format (built-in ... Stop bulk load transaction if a file includes invalid...
Read more >Treasure Data FAQs - Comparably
You must be an account owner or administrator to change the role of an existing user. Open TD Console. Read More.
Read more >Slycat - Release 3.1.1
Additionally, columns cannot contain the same value for all runs. Slycat™ will reject such columns from being included in the CCA analysis, ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@civitaspo Thanks for reporting!
Hmm, I’m not sure why they have been
'\0'
, but I thought they can be just-1
,-2
or like that since Java’schar
is signed 2-bytes. What do you think? @mugaYea, or we may be able to extract the value as config option parameter and users configure any value. It is
'\0'
by default.