grobid 0.7.2 not recognizing greek alphabets
See original GitHub issueI’m running grobid-0.7.2 on Windows 11 using docker. I followed instructions from your documentation to mount the docker image lfoppiano/grobid:0.7.2.
While PDF extraction and sentence extraction works like a charm, grobid is not recognizing greek alphabets in my PDFs. I have a lot of text in the PDF that includes greek alphabets as a suffix to english words. For example, here’s a sentence from my PDF - "It has also been reported that the production of interferon-g (IFN-g) may be lowered."
g
here is actually written as gamma in the PDF (I just didn’t know how to write greek alphabets here).
grobid is converting the greek alphabet gamma
(g) to the unicode delete character U+2425.
I also changed the sentence segmenter to pragmatic sentence detector in my local yaml file and mounted it using the following command
docker run --rm -p 8070:8070 -p 8070:8070 -v "D:/grobid/grobid-0.7.2/grobid-0.7.2/grobid-home/config/grobid.yaml":/opt/grobid/grobid-home/config/grobid.yaml:ro lfoppiano/grobid:0.7.2
grobid is still not parsing greek alphabets correctly. I’m pretty sure i’m missing something. Can someone please help me?
Thanks in advance PD
Issue Analytics
- State:
- Created 10 months ago
- Comments:6 (3 by maintainers)
Top GitHub Comments
Thanks @kermitt2. I’m closing this issue.
P.S. Not only is grobid a great tool, but support is also great. I’m glad I stumbled upon it.
Thanks @kermitt2. I figured as much that it’s a PDF encoding issue. And thanks for your offer to check it out in detail. Here’s the link to the PDF
https://www.academia.edu/download/73371279/4000650.pdf
Let me know if it does not work and I can upload the PDF to dropbox/google drive. Also, can you please let me know how you debug it?
Thanks again PD