question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Update severely out-of-date Lucene dependency

See original GitHub issue

The Lucene dependency is severely out-of-date, and has published CVEs. I made a pass at updating Lucene in the parent POM, but it is overridden in the warehouse/pom.xml back to 4.7.1. I can’t update it there, because stuff in warehouse/ingest-core still depends on tokenizer behavior from versions that are no longer supported in newer versions of Lucene. I’m not familiar enough with the code to understand the consequences of aggressively dropping that behavior, and using the newer, supported, tokenizer behavior.

This is the diff I have so far:

diff --git a/warehouse/ingest-core/src/main/java/datawave/ingest/data/tokenize/DefaultTokenSearch.java b/warehouse/ingest-core/src/main/java/datawave/ingest/data/tokenize/DefaultTokenSearch.java
index 8a8269e5..2d25c29b 100644
--- a/warehouse/ingest-core/src/main/java/datawave/ingest/data/tokenize/DefaultTokenSearch.java
+++ b/warehouse/ingest-core/src/main/java/datawave/ingest/data/tokenize/DefaultTokenSearch.java
@@ -11,7 +11,7 @@ import java.util.List;
 import java.util.Set;
 import java.util.regex.Pattern;
 
-import org.apache.lucene.analysis.util.CharArraySet;
+import org.apache.lucene.analysis.CharArraySet;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
 
diff --git a/warehouse/ingest-core/src/main/java/datawave/ingest/data/tokenize/StandardAnalyzer.java b/warehouse/ingest-core/src/main/java/datawave/ingest/data/tokenize/StandardAnalyzer.java
index 02eed1fa..b68ea470 100644
--- a/warehouse/ingest-core/src/main/java/datawave/ingest/data/tokenize/StandardAnalyzer.java
+++ b/warehouse/ingest-core/src/main/java/datawave/ingest/data/tokenize/StandardAnalyzer.java
@@ -8,8 +8,8 @@ import org.apache.lucene.analysis.core.LowerCaseFilter;
 import org.apache.lucene.analysis.core.StopAnalyzer;
 import org.apache.lucene.analysis.core.StopFilter;
 import org.apache.lucene.analysis.standard.StandardFilter;
-import org.apache.lucene.analysis.util.CharArraySet;
-import org.apache.lucene.analysis.util.StopwordAnalyzerBase;
+import org.apache.lucene.analysis.CharArraySet;
+import org.apache.lucene.analysis.StopwordAnalyzerBase;
 import org.apache.lucene.util.Version;
 
 /**
diff --git a/warehouse/ingest-core/src/main/java/datawave/ingest/data/tokenize/StandardTokenizer.java b/warehouse/ingest-core/src/main/java/datawave/ingest/data/tokenize/StandardTokenizer.java
index 004a4bdb..5899775f 100644
--- a/warehouse/ingest-core/src/main/java/datawave/ingest/data/tokenize/StandardTokenizer.java
+++ b/warehouse/ingest-core/src/main/java/datawave/ingest/data/tokenize/StandardTokenizer.java
@@ -24,6 +24,7 @@ import java.util.Map;
 
 import datawave.util.ObjectFactory;
 
+import org.apache.lucene.util.AttributeFactory;
 import org.apache.lucene.analysis.Tokenizer;
 import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
 import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
diff --git a/warehouse/ingest-core/src/main/java/datawave/ingest/data/tokenize/TokenSearch.java b/warehouse/ingest-core/src/main/java/datawave/ingest/data/tokenize/TokenSearch.java
index c42e8b9e..3468a774 100644
--- a/warehouse/ingest-core/src/main/java/datawave/ingest/data/tokenize/TokenSearch.java
+++ b/warehouse/ingest-core/src/main/java/datawave/ingest/data/tokenize/TokenSearch.java
@@ -9,8 +9,8 @@ import java.util.List;
 
 import datawave.util.ObjectFactory;
 
-import org.apache.lucene.analysis.util.CharArraySet;
-import org.apache.lucene.analysis.util.WordlistLoader;
+import org.apache.lucene.analysis.CharArraySet;
+import org.apache.lucene.analysis.WordlistLoader;
 import org.apache.lucene.util.IOUtils;
 import org.apache.lucene.util.Version;
 import org.slf4j.Logger;
diff --git a/warehouse/ingest-core/src/main/java/datawave/ingest/data/tokenize/TokenizationHelper.java b/warehouse/ingest-core/src/main/java/datawave/ingest/data/tokenize/TokenizationHelper.java
index a79d242c..16b882ea 100644
--- a/warehouse/ingest-core/src/main/java/datawave/ingest/data/tokenize/TokenizationHelper.java
+++ b/warehouse/ingest-core/src/main/java/datawave/ingest/data/tokenize/TokenizationHelper.java
@@ -8,7 +8,7 @@ import datawave.ingest.data.config.DataTypeHelper;
 import org.apache.hadoop.conf.Configuration;
 import org.apache.log4j.Logger;
 import org.apache.lucene.analysis.Analyzer;
-import org.apache.lucene.analysis.util.CharArraySet;
+import org.apache.lucene.analysis.CharArraySet;
 
 public class TokenizationHelper {
     
diff --git a/warehouse/ingest-core/src/main/java/datawave/ingest/mapreduce/handler/tokenize/ExtendedContentIndexingColumnBasedHandler.java b/warehouse/ingest-core/src/main/java/datawave/ingest/mapreduce/handler/tokenize/ExtendedContentIndexingColumnBasedHandler.java
index 6ee3863c..9ba5480c 100644
--- a/warehouse/ingest-core/src/main/java/datawave/ingest/mapreduce/handler/tokenize/ExtendedContentIndexingColumnBasedHandler.java
+++ b/warehouse/ingest-core/src/main/java/datawave/ingest/mapreduce/handler/tokenize/ExtendedContentIndexingColumnBasedHandler.java
@@ -54,7 +54,7 @@ import org.apache.hadoop.mapreduce.TaskAttemptContext;
 import org.apache.hadoop.mapreduce.TaskInputOutputContext;
 import org.apache.hadoop.util.bloom.BloomFilter;
 import org.apache.log4j.Logger;
-import org.apache.lucene.analysis.util.CharArraySet;
+import org.apache.lucene.analysis.CharArraySet;
 import org.infinispan.commons.util.Base64;
 
 import com.google.common.collect.Multimap;
diff --git a/warehouse/pom.xml b/warehouse/pom.xml
index 08228c48..56b20acc 100644
--- a/warehouse/pom.xml
+++ b/warehouse/pom.xml
@@ -45,7 +45,6 @@
         <version.basis-rlp>7.5.0</version.basis-rlp>
         <version.hadoop.processors>2.2.3</version.hadoop.processors>
         <version.hamcrest>1.3</version.hamcrest>
-        <version.lucene>4.7.1</version.lucene>
     </properties>
     <dependencyManagement>
         <dependencies>

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:12 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
drewfarriscommented, Oct 23, 2018

@ctubbsii - it doesn’t strike me that it would be especially difficult to replicate the apostrophe functionality with our own TokenFilter implementation, but I will have to dig into it.

0reactions
keith-ratcliffecommented, Oct 25, 2018

@drewfarris wrote:

Is there a PR up for this somewhere? I can’t seem to find it.

I haven’t seen one… just my diff above. @keith-ratcliffe 's comments imply he has a branch testing the change. Perhaps he’ll push them to his fork, and submit a PR? (nudge) 😺

Yep. See lucene-upgrade branch on this repo, and just created this PR

Read more comments on GitHub >

github_iconTop Results From Across the Web

Lucene Change Log
Lucene Change Log. Expand All Collapse All. For more information on past and future Lucene versions, please see: http://s.apache.org/luceneversions ...
Read more >
CHANGES.txt - lucene-solr
LUCENE -8527: Upgrade JFlex dependency to 1.7.0; in StandardTokenizer and ... LUCENE-8111: IndexOrDocValuesQuery Javadoc references outdated method name.
Read more >
Hibernate Search 6.1.7.Final: Reference Documentation
Full text search engines like Apache Lucene are very powerful ... to time Spring Boot dependencies will be a little out of date....
Read more >
Why upgrading software libraries is imperative
Unexpected unsupported library dependencies; Need for complete end-to-end testing due to major version of library upgrade; Lack of work being ...
Read more >
Introduction to Lucene - Java Code Geeks - 2022
In this course, you will get an introduction to Lucene. ... 4.1 Create a new Maven Project with Eclipse; 4.2 Maven Dependencies; 4.3....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found