Update severely out-of-date Lucene dependency
See original GitHub issueThe Lucene dependency is severely out-of-date, and has published CVEs. I made a pass at updating Lucene in the parent POM, but it is overridden in the warehouse/pom.xml back to 4.7.1. I can’t update it there, because stuff in warehouse/ingest-core still depends on tokenizer behavior from versions that are no longer supported in newer versions of Lucene. I’m not familiar enough with the code to understand the consequences of aggressively dropping that behavior, and using the newer, supported, tokenizer behavior.
This is the diff I have so far:
diff --git a/warehouse/ingest-core/src/main/java/datawave/ingest/data/tokenize/DefaultTokenSearch.java b/warehouse/ingest-core/src/main/java/datawave/ingest/data/tokenize/DefaultTokenSearch.java
index 8a8269e5..2d25c29b 100644
--- a/warehouse/ingest-core/src/main/java/datawave/ingest/data/tokenize/DefaultTokenSearch.java
+++ b/warehouse/ingest-core/src/main/java/datawave/ingest/data/tokenize/DefaultTokenSearch.java
@@ -11,7 +11,7 @@ import java.util.List;
import java.util.Set;
import java.util.regex.Pattern;
-import org.apache.lucene.analysis.util.CharArraySet;
+import org.apache.lucene.analysis.CharArraySet;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
diff --git a/warehouse/ingest-core/src/main/java/datawave/ingest/data/tokenize/StandardAnalyzer.java b/warehouse/ingest-core/src/main/java/datawave/ingest/data/tokenize/StandardAnalyzer.java
index 02eed1fa..b68ea470 100644
--- a/warehouse/ingest-core/src/main/java/datawave/ingest/data/tokenize/StandardAnalyzer.java
+++ b/warehouse/ingest-core/src/main/java/datawave/ingest/data/tokenize/StandardAnalyzer.java
@@ -8,8 +8,8 @@ import org.apache.lucene.analysis.core.LowerCaseFilter;
import org.apache.lucene.analysis.core.StopAnalyzer;
import org.apache.lucene.analysis.core.StopFilter;
import org.apache.lucene.analysis.standard.StandardFilter;
-import org.apache.lucene.analysis.util.CharArraySet;
-import org.apache.lucene.analysis.util.StopwordAnalyzerBase;
+import org.apache.lucene.analysis.CharArraySet;
+import org.apache.lucene.analysis.StopwordAnalyzerBase;
import org.apache.lucene.util.Version;
/**
diff --git a/warehouse/ingest-core/src/main/java/datawave/ingest/data/tokenize/StandardTokenizer.java b/warehouse/ingest-core/src/main/java/datawave/ingest/data/tokenize/StandardTokenizer.java
index 004a4bdb..5899775f 100644
--- a/warehouse/ingest-core/src/main/java/datawave/ingest/data/tokenize/StandardTokenizer.java
+++ b/warehouse/ingest-core/src/main/java/datawave/ingest/data/tokenize/StandardTokenizer.java
@@ -24,6 +24,7 @@ import java.util.Map;
import datawave.util.ObjectFactory;
+import org.apache.lucene.util.AttributeFactory;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
diff --git a/warehouse/ingest-core/src/main/java/datawave/ingest/data/tokenize/TokenSearch.java b/warehouse/ingest-core/src/main/java/datawave/ingest/data/tokenize/TokenSearch.java
index c42e8b9e..3468a774 100644
--- a/warehouse/ingest-core/src/main/java/datawave/ingest/data/tokenize/TokenSearch.java
+++ b/warehouse/ingest-core/src/main/java/datawave/ingest/data/tokenize/TokenSearch.java
@@ -9,8 +9,8 @@ import java.util.List;
import datawave.util.ObjectFactory;
-import org.apache.lucene.analysis.util.CharArraySet;
-import org.apache.lucene.analysis.util.WordlistLoader;
+import org.apache.lucene.analysis.CharArraySet;
+import org.apache.lucene.analysis.WordlistLoader;
import org.apache.lucene.util.IOUtils;
import org.apache.lucene.util.Version;
import org.slf4j.Logger;
diff --git a/warehouse/ingest-core/src/main/java/datawave/ingest/data/tokenize/TokenizationHelper.java b/warehouse/ingest-core/src/main/java/datawave/ingest/data/tokenize/TokenizationHelper.java
index a79d242c..16b882ea 100644
--- a/warehouse/ingest-core/src/main/java/datawave/ingest/data/tokenize/TokenizationHelper.java
+++ b/warehouse/ingest-core/src/main/java/datawave/ingest/data/tokenize/TokenizationHelper.java
@@ -8,7 +8,7 @@ import datawave.ingest.data.config.DataTypeHelper;
import org.apache.hadoop.conf.Configuration;
import org.apache.log4j.Logger;
import org.apache.lucene.analysis.Analyzer;
-import org.apache.lucene.analysis.util.CharArraySet;
+import org.apache.lucene.analysis.CharArraySet;
public class TokenizationHelper {
diff --git a/warehouse/ingest-core/src/main/java/datawave/ingest/mapreduce/handler/tokenize/ExtendedContentIndexingColumnBasedHandler.java b/warehouse/ingest-core/src/main/java/datawave/ingest/mapreduce/handler/tokenize/ExtendedContentIndexingColumnBasedHandler.java
index 6ee3863c..9ba5480c 100644
--- a/warehouse/ingest-core/src/main/java/datawave/ingest/mapreduce/handler/tokenize/ExtendedContentIndexingColumnBasedHandler.java
+++ b/warehouse/ingest-core/src/main/java/datawave/ingest/mapreduce/handler/tokenize/ExtendedContentIndexingColumnBasedHandler.java
@@ -54,7 +54,7 @@ import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.TaskInputOutputContext;
import org.apache.hadoop.util.bloom.BloomFilter;
import org.apache.log4j.Logger;
-import org.apache.lucene.analysis.util.CharArraySet;
+import org.apache.lucene.analysis.CharArraySet;
import org.infinispan.commons.util.Base64;
import com.google.common.collect.Multimap;
diff --git a/warehouse/pom.xml b/warehouse/pom.xml
index 08228c48..56b20acc 100644
--- a/warehouse/pom.xml
+++ b/warehouse/pom.xml
@@ -45,7 +45,6 @@
<version.basis-rlp>7.5.0</version.basis-rlp>
<version.hadoop.processors>2.2.3</version.hadoop.processors>
<version.hamcrest>1.3</version.hamcrest>
- <version.lucene>4.7.1</version.lucene>
</properties>
<dependencyManagement>
<dependencies>
Issue Analytics
- State:
- Created 5 years ago
- Comments:12 (3 by maintainers)
Top Results From Across the Web
Lucene Change Log
Lucene Change Log. Expand All Collapse All. For more information on past and future Lucene versions, please see: http://s.apache.org/luceneversions ...
Read more >CHANGES.txt - lucene-solr
LUCENE -8527: Upgrade JFlex dependency to 1.7.0; in StandardTokenizer and ... LUCENE-8111: IndexOrDocValuesQuery Javadoc references outdated method name.
Read more >Hibernate Search 6.1.7.Final: Reference Documentation
Full text search engines like Apache Lucene are very powerful ... to time Spring Boot dependencies will be a little out of date....
Read more >Why upgrading software libraries is imperative
Unexpected unsupported library dependencies; Need for complete end-to-end testing due to major version of library upgrade; Lack of work being ...
Read more >Introduction to Lucene - Java Code Geeks - 2022
In this course, you will get an introduction to Lucene. ... 4.1 Create a new Maven Project with Eclipse; 4.2 Maven Dependencies; 4.3....
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
@ctubbsii - it doesn’t strike me that it would be especially difficult to replicate the apostrophe functionality with our own TokenFilter implementation, but I will have to dig into it.
Yep. See lucene-upgrade branch on this repo, and just created this PR