question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Stuck at constructing embedding?

See original GitHub issue

I currently have a dataset with more than 10 million rows of data and 384 dimensions. I use PCA to reduce the 384 dimensions to 10, and then apply UMAP via the BertTopic library.

To avoid running into memory issues, I am using a machine with 1TB of RAM and 128 cores. However, it seems that the process hang at “Construct embedding”, and only about 500GB of RAM is being used (so not a memory issue).

Here are the code and verbose:


embeddings = np.load('embeddings.npy')

pca = PCA(n_components=10)
embeddings_pca = pca.fit_transform(embeddings)

vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words="english")

umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', low_memory = True, verbose=True)

# Setting HDBSCAN model
hdbscan_model = HDBSCAN(min_cluster_size=10, metric='euclidean', cluster_selection_method='eom', prediction_data=True)

topic_model = BERTopic(umap_model = umap_model, hdbscan_model=hdbscan_model,  verbose=True, seed_topic_list=seed_topic_list, low_memory=True, calculate_probabilities=True, vectorizer_model=vectorizer_model)

#topics, probs = topic_model.fit_transform(docs)

topic_model = topic_model.fit(docs, embeddings_pca)
UMAP(angular_rp_forest=True, dens_frac=0.0, dens_lambda=0.0, metric='cosine',
     min_dist=0.0, n_components=5, verbose=True)
Construct fuzzy simplicial set
Tue Sep 28 11:33:15 2021 Finding Nearest Neighbors
Tue Sep 28 11:33:15 2021 Building RP forest with 64 trees
Tue Sep 28 11:34:42 2021 NN descent for 23 iterations
	 1  /  23
	 2  /  23
	Stopping threshold met -- exiting after 2 iterations
Tue Sep 28 11:49:29 2021 Finished Nearest Neighbor Search
Tue Sep 28 11:50:33 2021 Construct embedding

If I understand correctly, the most memory consuming step should be nearest neighbour search (which it completed with no issue)? How come does it stuck at constructing embeddings?

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:22 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
jlmelvillecommented, Sep 30, 2021

n_neighbors

@jlmelville What is a reasonable number for n_neighbors when I have 10 million data points?

Really hard to say. I would start by seeing if n_neighbors=30 works and take it from there. Obviously with such a large dataset, doubling parameters isn’t something to do lightly, but parameters for experimenting with the spectral initialization directly aren’t exposed through the UMAP interface, so it’s difficult to do anything else.

Or can I simply change the initialisation method to random here?

init="random" will work but it’s hard for UMAP (or any dimensionality reduction method that works in a similar way) to recover the global structure from a random start. If you have access to an efficient PCA package, then extracting the first two principal components (suitably scaled) and passing that as the init parameter would be a better starting point.

It’s also possible that there is something in your dataset that is making the initialization take so long: are there lots of duplicates or close duplicates or all-zero rows? Bad behavior of the spectral initialization does seem to be related to the conditioning of the graph Laplacian matrix.

1reaction
jlmelvillecommented, Sep 30, 2021

Your stack trace from the interrupt indicates that the problem is occurring at the spectral initialization stage. Where this has happened to me it seems to be when the graph is very nearly disconnected, but there are a few low-affinity edges that mean the disconnection detection routine still sees it as one connected graph.

If you are able to, try increasing n_neighbors.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Xcode stuck on embedding provisioning profile
Xcode seems to be recompiling the bitcode while showing this message and it can, therefore, take a very long time to "embed the...
Read more >
[ABANDONED] SES Startup stuck at "Building"
I'm still having the intermittent issue where SES locks up, and when I go into the task manager and attempt to restart SES...
Read more >
Stuck Embedding CSS Link in HTML Page For SharePoint
I have a custom HTML page that is a single page that takes in WebParts. It is a new WebParts interface and there...
Read more >
xCode stuck on "signing product" | Apple Developer Forums
It just hangs on the "Signing product" step. ... I've tried that "Code Sign on Copy" switch in the Embed Frameworks build phase...
Read more >
Getting stuck on (creating connection in Model) while refreshing
I'm using the latest version of Power BI, i.e, 2.99.862.0 (November) and I have also turned off data load settings for autodetect and...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found