question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cannot load text correctly by CsvParserPlugin if it includes `\0`.

See original GitHub issue

Hi maintainers,

We sometimes encounter a problem CSV Parser cannot parse string containing \0 . This seems to be caused by CSV Parser judging that \0 is the end of a sentence.

https://github.com/embulk/embulk/blob/45aa8fc1520cb43f9285b161cb1d033c90cc33fd/embulk-standards/src/main/java/org/embulk/standards/CsvTokenizer.java#L24

How can we avoid this problem? or Could you solve the problem?

See the following code to reproduce the problem.

$ ruby -e '5.times {|i| puts "#{i}\taaaaa\0aaa#{i}a\tname#{i}"}' > data.tsv
$ cat data.tsv  # the below text includes `\0`
0   aaaaaaaa0a  name0
1   aaaaaaaa1a  name1
2   aaaaaaaa2a  name2
3   aaaaaaaa3a  name3
4   aaaaaaaa4a  name4

$ cat config.yml
in:
  type: file
  path_prefix: ./data.tsv
  parser:
    type: csv
    delimiter: "\t"
    skip_header_lines: 0
    columns:
      - { name: id, type: long }
      - { name: description, type: string }
      - { name: name, type: string }
    stop_on_invalid_record: true
out:
  type: stdout
  
$ embulk run config.yml
2017-05-16 13:24:04.405 +0900: Embulk v0.8.21
2017-05-16 13:24:06.004 +0900 [INFO] (0001:transaction): Listing local files at directory '.' filtering filename by prefix 'data.tsv'
2017-05-16 13:24:06.006 +0900 [INFO] (0001:transaction): "follow_symlinks" is set false. Note that symbolic links to directories are skipped.
2017-05-16 13:24:06.010 +0900 [INFO] (0001:transaction): Loading files [data.tsv]
2017-05-16 13:24:06.055 +0900 [INFO] (0001:transaction): Using local thread executor with max_threads=16 / output tasks 8 = input tasks 1 * 8
2017-05-16 13:24:06.062 +0900 [INFO] (0001:transaction): {done:  0 / 1, running: 0}
2017-05-16 13:24:06.121 +0900 [INFO] (0001:transaction): {done:  1 / 1, running: 0}
org.embulk.exec.PartialExecutionException: org.embulk.spi.DataException: Invalid record at line 1: 0	aaaaaaaa0a	name0
	at org.embulk.exec.BulkLoader$LoaderState.buildPartialExecuteException(org/embulk/exec/BulkLoader.java:373)
	at org.embulk.exec.BulkLoader.doRun(org/embulk/exec/BulkLoader.java:591)
	at org.embulk.exec.BulkLoader.access$000(org/embulk/exec/BulkLoader.java:33)
	at org.embulk.exec.BulkLoader$1.run(org/embulk/exec/BulkLoader.java:389)
	at org.embulk.exec.BulkLoader$1.run(org/embulk/exec/BulkLoader.java:385)
	at org.embulk.spi.Exec.doWith(org/embulk/spi/Exec.java:25)
	at org.embulk.exec.BulkLoader.run(org/embulk/exec/BulkLoader.java:385)
	at org.embulk.EmbulkEmbed.run(org/embulk/EmbulkEmbed.java:180)
	at java.lang.reflect.Method.invoke(java/lang/reflect/Method.java:498)
	at org.jruby.javasupport.JavaMethod.invokeDirectWithExceptionHandling(org/jruby/javasupport/JavaMethod.java:453)
	at org.jruby.javasupport.JavaMethod.invokeDirect(org/jruby/javasupport/JavaMethod.java:314)
	at RUBY.run(/Users/takahiro.nakayama/bin/embulk!/embulk/runner.rb:84)
	at RUBY.run(/Users/takahiro.nakayama/bin/embulk!/embulk/command/embulk_run.rb:269)
	at RUBY.<main>(/Users/takahiro.nakayama/bin/embulk!/embulk/command/embulk_main.rb:2)
	at org.jruby.Ruby.runInterpreter(org/jruby/Ruby.java:850)
	at org.jruby.Ruby.loadFile(org/jruby/Ruby.java:2976)
	at org.jruby.RubyKernel.requireCommon(org/jruby/RubyKernel.java:963)
	at org.jruby.RubyKernel.require(org/jruby/RubyKernel.java:956)
	at org.jruby.RubyKernel$INVOKER$s$1$0$require19.call(org/jruby/RubyKernel$INVOKER$s$1$0$require19.gen)
	at RUBY.(root)(uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/rubygems/core_ext/kernel_require.rb:1)
	at Users.takahiro_dot_nakayama.bin.embulk.embulk.command.embulk_bundle.invokeOther66:require(Users/takahiro_dot_nakayama/bin/embulk/embulk/command/file:/Users/takahiro.nakayama/bin/embulk!/embulk/command/embulk_bundle.rb:51)
	at Users.takahiro_dot_nakayama.bin.embulk.embulk.command.embulk_bundle.<main>(file:/Users/takahiro.nakayama/bin/embulk!/embulk/command/embulk_bundle.rb:51)
	at java.lang.invoke.MethodHandle.invokeWithArguments(java/lang/invoke/MethodHandle.java:627)
	at org.jruby.Ruby.runScript(org/jruby/Ruby.java:834)
	at org.jruby.Ruby.runNormally(org/jruby/Ruby.java:749)
	at org.jruby.Ruby.runNormally(org/jruby/Ruby.java:767)
	at org.jruby.Ruby.runFromMain(org/jruby/Ruby.java:580)
	at org.jruby.Main.doRunFromMain(org/jruby/Main.java:425)
	at org.jruby.Main.internalRun(org/jruby/Main.java:313)
	at org.jruby.Main.run(org/jruby/Main.java:242)
	at org.jruby.Main.main(org/jruby/Main.java:204)
	at org.embulk.cli.Main.main(org/embulk/cli/Main.java:23)
Caused by: org.embulk.spi.DataException: Invalid record at line 1: 0	aaaaaaaa0a	name0
	at org.embulk.standards.CsvParserPlugin.run(CsvParserPlugin.java:368)
	at org.embulk.spi.FileInputRunner.run(FileInputRunner.java:156)
	at org.embulk.exec.LocalExecutorPlugin$ScatterExecutor.runInputTask(LocalExecutorPlugin.java:294)
	at org.embulk.exec.LocalExecutorPlugin$ScatterExecutor.access$000(LocalExecutorPlugin.java:212)
	at org.embulk.exec.LocalExecutorPlugin$ScatterExecutor$1.call(LocalExecutorPlugin.java:257)
	at org.embulk.exec.LocalExecutorPlugin$ScatterExecutor$1.call(LocalExecutorPlugin.java:253)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.embulk.standards.CsvTokenizer$TooFewColumnsException: Too few columns
	at org.embulk.standards.CsvTokenizer.nextColumn(CsvTokenizer.java:168)
	at org.embulk.standards.CsvTokenizer.nextColumnOrNull(CsvTokenizer.java:379)
	at org.embulk.standards.CsvParserPlugin$1.nextColumn(CsvParserPlugin.java:346)
	at org.embulk.standards.CsvParserPlugin$1.stringColumn(CsvParserPlugin.java:302)
	at org.embulk.spi.Column.visit(Column.java:58)
	at org.embulk.spi.Schema.visitColumns(Schema.java:81)
	at org.embulk.standards.CsvParserPlugin.run(CsvParserPlugin.java:259)
	... 9 more

Error: org.embulk.spi.DataException: Invalid record at line 1: 0	aaaaaaaa0a	name0

Issue Analytics

  • State:open
  • Created 6 years ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
dmikurubecommented, May 16, 2017

@civitaspo Thanks for reporting!

Hmm, I’m not sure why they have been '\0', but I thought they can be just -1, -2 or like that since Java’s char is signed 2-bytes. What do you think? @muga

0reactions
mugacommented, May 17, 2017

Yea, or we may be able to extract the value as config option parameter and users configure any value. It is '\0' by default.

Read more comments on GitHub >

github_iconTop Results From Across the Web

javascript: Uncaught TypeError: Cannot read property 'objects ...
I am trying to use the jQuery CSV parser plugin. However I got this console message: Uncaught TypeError ...
Read more >
Tackling the most common errors when trying to import a CSV
Everyone who has imported data has run into import errors - it doesn't matter if you are a first-timer, or a seasoned pro...
Read more >
Configuration - Embulk
parser: If the input is file-based, parser plugin parses a file format (built-in ... Stop bulk load transaction if a file includes invalid...
Read more >
Treasure Data FAQs - Comparably
You must be an account owner or administrator to change the role of an existing user. Open TD Console. Read More.
Read more >
Slycat - Release 3.1.1
Additionally, columns cannot contain the same value for all runs. Slycat™ will reject such columns from being included in the CCA analysis, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found