question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unable to create multiple lsh indices each one in its own keyspace

See original GitHub issue

First of all, thank you for great work @ekzhu! Here is a reproducible test that shows that my expectation is to create 1 keyspace per each LSH index unfortunately all LSH tables are being created in the scope of the same Cassandra keyspace.

def create_lsh_index(index):
    print("(Create index: {}) - ".format(index), end='')
    threshold = 0.75
    num_perm = 128
    doc1 = MinHash(num_perm)
    lsh = MinHashLSH(
        threshold=threshold, num_perm=num_perm, storage_config={
            'type': 'cassandra',
            'basename': index.encode('ascii'),
            'cassandra': {
                'seeds': cassandra_seeds,
                'keyspace': index,
                'replication': {
                    'class': 'SimpleStrategy',
                    'replication_factor': '3',
                },
                'drop_keyspace': True,
                'drop_tables': True,
            }
        }
    )
    lsh.insert("a", doc1)
    lsh.insert("b", doc1)
    counts = lsh.get_counts()
    # second instance
    assert len(counts) == 11


def test_cassandra_multi_index():
    create_lsh_index('idx1')
    create_lsh_index('idx2')
    create_lsh_index('idx3')

The produced result inside of Cassandra DB, please see below:

cqlsh> DESCRIBE keyspaces;

system_schema  system_traces  **idx1**      system_distributed_everywhere
system_auth    same_index     system  system_distributed 

cqlsh> use idx1;

cqlsh:idx1> desc tables;

lsh_idx3_keys         lsh_idx2_bucket_0003  lsh_idx3_bucket_0006
lsh_idx2_bucket_000a  lsh_idx2_bucket_0002  lsh_idx3_bucket_0007
lsh_idx1_keys         lsh_idx2_bucket_0009  lsh_idx1_bucket_0008
lsh_idx2_keys         lsh_idx2_bucket_0008  lsh_idx1_bucket_0009
lsh_idx1_bucket_000a  lsh_idx3_bucket_0008  lsh_idx1_bucket_0006
lsh_idx3_bucket_000a  lsh_idx3_bucket_0009  lsh_idx1_bucket_0007
lsh_idx2_bucket_0005  lsh_idx3_bucket_0000  lsh_idx1_bucket_0004
lsh_idx2_bucket_0004  lsh_idx3_bucket_0001  lsh_idx1_bucket_0005
lsh_idx2_bucket_0007  lsh_idx3_bucket_0002  lsh_idx1_bucket_0002
lsh_idx2_bucket_0006  lsh_idx3_bucket_0003  lsh_idx1_bucket_0003
lsh_idx2_bucket_0001  lsh_idx3_bucket_0004  lsh_idx1_bucket_0000
lsh_idx2_bucket_0000  lsh_idx3_bucket_0005  lsh_idx1_bucket_0001

Looking forward to hear from you what we are doing wrong since we don’t have any practical experience with datasketch yet.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
ronassacommented, Jan 16, 2022

hi @ekzhu, add a pull request that fixes this issue by creating\switching to different keyspace when needed. https://github.com/ekzhu/datasketch/pull/172

0reactions
ekzhucommented, Feb 4, 2022

Merged an released.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Releases · ekzhu/datasketch - GitHub
MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, ... Unable to create multiple lsh indices each one in its own keyspace - issue #171...
Read more >
MinHash LSH — datasketch 1.0.0 documentation
To create index for a large number of MinHashes using asynchronous MinHash LSH. To bulk remove keys from LSH index using asynchronous MinHash...
Read more >
reformer: the efficient transformer - arXiv
To implement masking in LSH attention, we associate every query/key vector with a position index, re-order the position indices using the same ......
Read more >
Locality Sensitive Hashing (LSH): The Illustrated Guide
The magic, theory, and practice of Locality Sensitive Hashing. ... All we do is create an empty vector full of zeros and the...
Read more >
Reformer Reproducibility – Weights & Biases - Wandb
The next step is to add LSH clustering to our new attention mechanism. We refer to figure 2 of the reformer paper that...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found