question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Processing failed with error 500. .MalformedByteSequenceException: Invalid byte 2 of 3-byte UTF-8 sequence.

See original GitHub issue

Hi,

Thanks for providing such a nice project. I have a similar issue as #472 . Put the screen shot here first.

I build my local grobid server, then use the python client. The procedure is processFulltextDocument.

The error messages are here:

127.0.0.1 - - [30/Jul/2020:02:14:59 +0000] "POST /api/processFulltextDocument HTTP/1.1" 500 89 "-" "python-requests/2.23.0" 378
ERROR [2020-07-30 02:14:59,633] org.grobid.service.process.GrobidRestProcessFiles: An unexpected exception occurs.
! org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 3-byte UTF-8 sequence.
! at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source)
! at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
! at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
! at org.apache.xerces.impl.XMLEntityScanner.scanLiteral(Unknown Source)
! at org.apache.xerces.impl.XMLScanner.scanAttributeValue(Unknown Source)
! at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanAttribute(Unknown Source)
! at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanStartElement(Unknown Source)
! ... 87 common frames omitted
! Causing: org.xml.sax.SAXParseException: Invalid byte 2 of 3-byte UTF-8 sequence.
! at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
! at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
! at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
! at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
! at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
! at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
! at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
! at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
! at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
! at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
! at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
! at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
! at java.xml/javax.xml.parsers.SAXParser.parse(SAXParser.java:197)
! at org.grobid.core.document.Document.addTokenizedDocument(Document.java:376)
! ... 77 common frames omitted
! Causing: org.grobid.core.exceptions.GrobidException: [PARSING_ERROR] Cannot parse file: /Ship03/Sources/grobid/grobid-home/tmp/E38zA47w7D.lxml
! at org.grobid.core.document.Document.addTokenizedDocument(Document.java:388)
! at org.grobid.core.engines.Segmentation.processing(Segmentation.java:94)
! at org.grobid.core.engines.FullTextParser.processing(FullTextParser.java:134)
! at org.grobid.core.engines.FullTextParser.processing(FullTextParser.java:113)
! at org.grobid.core.engines.Engine.fullTextToTEIDoc(Engine.java:489)
! at org.grobid.core.engines.Engine.fullTextToTEI(Engine.java:480)
! at org.grobid.service.process.GrobidRestProcessFiles.processFulltextDocument(GrobidRestProcessFiles.java:179)
! at org.grobid.service.GrobidRestService.processFulltext(GrobidRestService.java:234)
! at org.grobid.service.GrobidRestService.processFulltextDocument_post(GrobidRestService.java:189)
! at jdk.internal.reflect.GeneratedMethodAccessor14.invoke(Unknown Source)
! at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
! at java.base/java.lang.reflect.Method.invoke(Method.java:566)
! at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory$1.invoke(ResourceMethodInvocationHandlerFactory.java:81)
! at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:144)
! at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:161)
! at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:160)
! at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:99)
! at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:389)
! at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:347)
! at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:102)
! at org.glassfish.jersey.server.ServerRuntime$2.run(ServerRuntime.java:326)
! at org.glassfish.jersey.internal.Errors$1.call(Errors.java:271)
! at org.glassfish.jersey.internal.Errors$1.call(Errors.java:267)
! at org.glassfish.jersey.internal.Errors.process(Errors.java:315)
! at org.glassfish.jersey.internal.Errors.process(Errors.java:297)
! at org.glassfish.jersey.internal.Errors.process(Errors.java:267)
! at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:317)
! at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:305)
! at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:1154)
! at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:473)
! at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:427)
! at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388)
! at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341)
! at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228)
! at io.dropwizard.jetty.NonblockingServletHolder.handle(NonblockingServletHolder.java:49)
! at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1623)
! at io.dropwizard.servlets.ThreadNameFilter.doFilter(ThreadNameFilter.java:35)
! at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610)
! at io.dropwizard.jersey.filter.AllowedMethodsFilter.handle(AllowedMethodsFilter.java:45)
! at io.dropwizard.jersey.filter.AllowedMethodsFilter.doFilter(AllowedMethodsFilter.java:39)
! at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610)
! at org.eclipse.jetty.servlets.CrossOriginFilter.handle(CrossOriginFilter.java:311)
! at org.eclipse.jetty.servlets.CrossOriginFilter.doFilter(CrossOriginFilter.java:265)
! at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610)
! at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:89)
! at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:120)
! at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:135)
! at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610)
! at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540)
! at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
! at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1345)
! at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
! at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480)
! at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
! at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1247)
! at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
! at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
! at com.codahale.metrics.jetty9.InstrumentedHandler.handle(InstrumentedHandler.java:239)
! at io.dropwizard.jetty.RoutingHandler.handle(RoutingHandler.java:52)
! at org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:703)
! at io.dropwizard.jetty.BiDiGzipHandler.handle(BiDiGzipHandler.java:67)
! at org.eclipse.jetty.server.handler.RequestLogHandler.handle(RequestLogHandler.java:56)
! at org.eclipse.jetty.server.handler.StatisticsHandler.handle(StatisticsHandler.java:174)
! at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
! at org.eclipse.jetty.server.Server.handle(Server.java:505)
! at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:370)
! at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:267)
! at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305)
! at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
! at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117)
! at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
! at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)
! at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
! at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)
! at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)
! at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:698)
! at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:804)

I also try your website version, got the same error.

image

Here are my pdf files. Hu_Relation_Networks_for_CVPR_2018_paper.pdf Shen_Neural_Style_Transfer_CVPR_2018_paper.pdf

Thank you for your help.

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:5

github_iconTop GitHub Comments

1reaction
lfoppianocommented, Jul 30, 2020

Thanks @xmlyqing00. I’m reopening this issue since we want to fix the problem in pdfalto 😉

1reaction
lfoppianocommented, Jul 30, 2020

Dear @xmlyqing00 thanks for reporting this problem and providing us with some useful error cases.

As @kermitt2 explained in #472 this problem must be addressed in the underlying pdfalto project, see issue: https://github.com/kermitt2/pdfalto/issues/68.

If you need to have the service working with these pdfs, I suggest you to merge the PR #475 into your grobid local git repository. That should fix the issue for the time being.

Read more comments on GitHub >

github_iconTop Results From Across the Web

What does the message "Invalid byte 2 of a 3-byte UTF-8 ...
This happens when Orbeon Forms reads an XML file and expects it to use the UTF-8 encoding, but somehow the file isn't properly...
Read more >
Invalid byte 2 of 3-byte UTF-8 sequence - Search - Informatica
When a process ( which utilizes a JDBC connection ) is executed within Application integration in an IICS org, it fails with the...
Read more >
Getting Invalid byte 2 of 3-byte UTF-8 sequence Error
Error is: The server encountered an unexpected condition that prevented it from fulfilling the request. Headers sent by the server: HTTP/1.1 500 ...
Read more >
Invalid byte 2 of 2-byte UTF-8 sequence - Bugs - Eclipse
I20070220-1330 In a workspace with most SDK plug-ins from HEAD, I tried to search for references to type org.eclipse.jdt.internal.ui.dialogs.
Read more >
SAX - Invalid byte 1 of 1-byte UTF-8 sequence - Mkyong.com
1. SAX parser unable to parse UTF-8 XML? 2. Character encoding in XML; 3. Character encoding in source code; 4. Download Source ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found