question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Question about blockEntityLinkage

See original GitHub issue

I have a question about the following line :

 val linkedResults = LuceneRDD.blockEntityLinkage(a, b,
      linker,
      Array("blocker"),
      Array("blocker"),
      500
    )

How will it link a and b?

a ) Search for all the entries of a in b , b) All the entries of b in a c) combine a and b and search both?

Ideally , I’d like to archive the option b) , where I have a small table “b” that I’d like to link with a larger table “a”

Thank you.

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:8 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
zouziascommented, Mar 11, 2019

You can do:

val topK = 5
val linkedResults = LuceneRDD.blockEntityLinkage(B, A, linker, blockingFields, blockingFields, topK)
1reaction
zouziascommented, Mar 9, 2019

The semantics of the block linkage is as follows:

  1. Split a and b into blocks based on the value on column “blocker”. Say a value of the blocker column is “blocker_value_1”.
  2. Then, approximate link all values between a and b that have value on column “blocker” equal to exactly “blocker_value_1”.

You could use blocking on any column where you know that a and b have common values and you can assume that you can condition on such a column.

See Slide 15 here: http://helios.mi.parisdescartes.fr/~themisp/publications/PapadakisPalpanas-TutorialWWW2018.pdf

Read more comments on GitHub >

github_iconTop Results From Across the Web

org.apache.spark.SparkConf Scala Example
blockEntityLinkage" should "deduplicate elements on unique elements" in { val spark = SparkSession.builder().getOrCreate() import spark.implicits.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found