Unable to load models on Windows 10
See original GitHub issueApologies as I cannot say if this is 100% a bug, but it is nonetheless unexpected behavior after following the documentation.
I have a simple example I am trying to in Spark NLP using scala.
As background I am on a windows machine and I have installed Spark 3.1.1 with prebuilt hadoop 2.7 (following these instructions https://phoenixnap.com/kb/install-spark-on-windows-10). Basic spark programs appear to work as expected which leads me to think the problem is not alone with spark and hadoop - paths set for both SPARK_HOME and HADOOP_HOME, correct winutils.exe file put in hadoop/bin folder, etc.
Description
I have the following simple spark nlp application in scala
import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._
import com.johnsnowlabs.nlp._
import org.apache.spark.ml.Pipeline
import com.johnsnowlabs.nlp.embeddings.BertEmbeddings
object SparkNLPExplore extends App{
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
.setCleanupMode("shrink")
val sentenceDetector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
.setLazyAnnotator(false)
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
.setContextChars(Array("(", ")", "?", "!"))
.setSplitChars(Array("-"))
.setExceptions(Array("New York", "e-mail"))
.setSplitPattern("'")
.setMaxLength(0)
.setMaxLength(99999)
.setCaseSensitiveExceptions(false)
val embeddings = BertEmbeddings.pretrained("bert_base_cased", "en")
.setInputCols("sentence", "token")
.setOutputCol("embeddings")
println(embeddings)
}
which yields the following output
[info] done compiling
[info] running SparkNLPExplore
bert_base_cased download started this may take some time.
Approximate size to download 389.1 MB
[ WARN] Exception when trying to compute pagesize, as a result reporting of ProcessTree metrics is stopped
Download done! Loading the resource.
[error] (run-main-1) java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$POSIX.stat(Ljava/lang/String;)Lorg/apache/hadoop/io/nativeio/NativeIO$POSIX$Stat;
[error] java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$POSIX.stat(Ljava/lang/String;)Lorg/apache/hadoop/io/nativeio/NativeIO$POSIX$Stat;
[error] at org.apache.hadoop.io.nativeio.NativeIO$POSIX.stat(Native Method)
[error] at org.apache.hadoop.io.nativeio.NativeIO$POSIX.getStat(NativeIO.java:460)
[error] at org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfoByNativeIO(RawLocalFileSystem.java:821)
[error] at org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:735)
[error] at org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:703)
[error] at org.apache.hadoop.fs.LocatedFileStatus.<init>(LocatedFileStatus.java:52)
[error] at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:2091)
[error] at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:2071)
[error] at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:280)
[error] at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:239)
[error] at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:325)
[error] at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:205)
[error] at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
[error] at scala.Option.getOrElse(Option.scala:189)
[error] at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
[error] at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
[error] at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
[error] at scala.Option.getOrElse(Option.scala:189)
[error] at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
[error] at org.apache.spark.rdd.RDD.$anonfun$take$1(RDD.scala:1428)
[error] at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
[error] at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
[error] at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
[error] at org.apache.spark.rdd.RDD.take(RDD.scala:1422)
[error] at org.apache.spark.rdd.RDD.$anonfun$first$1(RDD.scala:1463)
[error] at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
[error] at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
[error] at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
[error] at org.apache.spark.rdd.RDD.first(RDD.scala:1463)
[error] at org.apache.spark.ml.util.DefaultParamsReader$.loadMetadata(ReadWrite.scala:587)
[error] at org.apache.spark.ml.util.DefaultParamsReader.load(ReadWrite.scala:465)
[error] at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:12)
[error] at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:8)
[error] at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadModel(ResourceDownloader.scala:363)
[error] at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadModel(ResourceDownloader.scala:357)
[error] at com.johnsnowlabs.nlp.HasPretrained.pretrained(HasPretrained.scala:27)
[error] at com.johnsnowlabs.nlp.HasPretrained.pretrained$(HasPretrained.scala:24)
[error] at com.johnsnowlabs.nlp.embeddings.BertEmbeddings$.com$johnsnowlabs$nlp$embeddings$ReadablePretrainedBertModel$$super$pretrained(BertEmbeddings.scala:290)
[error] at com.johnsnowlabs.nlp.embeddings.ReadablePretrainedBertModel.pretrained(BertEmbeddings.scala:246)
[error] at com.johnsnowlabs.nlp.embeddings.ReadablePretrainedBertModel.pretrained$(BertEmbeddings.scala:246)
[error] at com.johnsnowlabs.nlp.embeddings.BertEmbeddings$.pretrained(BertEmbeddings.scala:290)
[error] at com.johnsnowlabs.nlp.embeddings.BertEmbeddings$.pretrained(BertEmbeddings.scala:290)
[error] at com.johnsnowlabs.nlp.HasPretrained.pretrained(HasPretrained.scala:30)
[error] at com.johnsnowlabs.nlp.HasPretrained.pretrained$(HasPretrained.scala:30)
[error] at com.johnsnowlabs.nlp.embeddings.BertEmbeddings$.com$johnsnowlabs$nlp$embeddings$ReadablePretrainedBertModel$$super$pretrained(BertEmbeddings.scala:290)
[error] at com.johnsnowlabs.nlp.embeddings.ReadablePretrainedBertModel.pretrained(BertEmbeddings.scala:245)
[error] at com.johnsnowlabs.nlp.embeddings.ReadablePretrainedBertModel.pretrained$(BertEmbeddings.scala:245)
[error] at com.johnsnowlabs.nlp.embeddings.BertEmbeddings$.pretrained(BertEmbeddings.scala:290)
[error] at SparkNLPExplore$.delayedEndpoint$SparkNLPExplore$1(SparkNLPExplore.scala:31)
[error] at SparkNLPExplore$delayedInit$body.apply(SparkNLPExplore.scala:8)
[error] at scala.Function0.apply$mcV$sp(Function0.scala:39)
[error] at scala.Function0.apply$mcV$sp$(Function0.scala:39)
[error] at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)
[error] at scala.App.$anonfun$main$1$adapted(App.scala:80)
[error] at scala.collection.immutable.List.foreach(List.scala:392)
[error] at scala.App.main(App.scala:80)
[error] at scala.App.main$(App.scala:78)
[error] at SparkNLPExplore$.main(SparkNLPExplore.scala:8)
[error] at SparkNLPExplore.main(SparkNLPExplore.scala)
[error] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[error] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
[error] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[error] at java.lang.reflect.Method.invoke(Method.java:498)
[error] stack trace is suppressed; run 'last Compile / bgRun' for the full output
[ERROR] uncaught error in thread spark-listener-group-appStatus, stopping SparkContext
For reference here is my sbt file:
name := "spark-nlp"
version := "0.1"
scalaVersion := "2.12.10"
// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "3.1.1",
"org.apache.spark" %% "spark-mllib" % "3.1.1",
"com.johnsnowlabs.nlp" %% "spark-nlp" % "3.0.1"
)
Expected Behavior
I would expect the embeddings to be loaded and the object information to be printed
Current Behavior
It seems the pre-trained embeddings download but an exception is caused when trying to load the resource. This behavior seems to be consistent when trying to access any of the pre-trained models.
Possible Solution
Perhaps this is something as silly as a version clash though everything looks OK to me.
Things I have tried so far:
- I have also tried adding additionally the hadoop.dll to both %HADOOP_HOME%/bin and C:/Windows/System32 with no luck
- I also updated permission of the winutils file as suggested here #1022
Steps to Reproduce
- Run simple scala program in Windows 10 with spark 3.1.1 and hadoop
Context
I am not able to use spark-nlp. Can someone please help me - I have spent many hours already trying to resolve this issue.
Your Environment
- Spark NLP version
sparknlp.version()
: 3.1.1 - Apache NLP version
spark.version
: 2.7 - Java version
java -version
: OpenJDK 64-Bit Server VM, Java 1.8.0_275 - Setup and installation (Pypi, Conda, Maven, etc.):
- Operating System and version: Windows 10
- Link to your project (if any):
Issue Analytics
- State:
- Created 2 years ago
- Comments:13 (6 by maintainers)
HI @masonedmison I’m working on your description to reproduce the issue. I’ll get back to you as soon as I have inspected the env and got the same outcome. Thank you for your patience
Hello, I close the ticket as the issue is solved.