Stuck at constructing embedding?
See original GitHub issueI currently have a dataset with more than 10 million rows of data and 384 dimensions. I use PCA to reduce the 384 dimensions to 10, and then apply UMAP via the BertTopic library.
To avoid running into memory issues, I am using a machine with 1TB of RAM and 128 cores. However, it seems that the process hang at “Construct embedding”, and only about 500GB of RAM is being used (so not a memory issue).
Here are the code and verbose:
embeddings = np.load('embeddings.npy')
pca = PCA(n_components=10)
embeddings_pca = pca.fit_transform(embeddings)
vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words="english")
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', low_memory = True, verbose=True)
# Setting HDBSCAN model
hdbscan_model = HDBSCAN(min_cluster_size=10, metric='euclidean', cluster_selection_method='eom', prediction_data=True)
topic_model = BERTopic(umap_model = umap_model, hdbscan_model=hdbscan_model, verbose=True, seed_topic_list=seed_topic_list, low_memory=True, calculate_probabilities=True, vectorizer_model=vectorizer_model)
#topics, probs = topic_model.fit_transform(docs)
topic_model = topic_model.fit(docs, embeddings_pca)
UMAP(angular_rp_forest=True, dens_frac=0.0, dens_lambda=0.0, metric='cosine',
min_dist=0.0, n_components=5, verbose=True)
Construct fuzzy simplicial set
Tue Sep 28 11:33:15 2021 Finding Nearest Neighbors
Tue Sep 28 11:33:15 2021 Building RP forest with 64 trees
Tue Sep 28 11:34:42 2021 NN descent for 23 iterations
1 / 23
2 / 23
Stopping threshold met -- exiting after 2 iterations
Tue Sep 28 11:49:29 2021 Finished Nearest Neighbor Search
Tue Sep 28 11:50:33 2021 Construct embedding
If I understand correctly, the most memory consuming step should be nearest neighbour search (which it completed with no issue)? How come does it stuck at constructing embeddings?
Issue Analytics
- State:
- Created 2 years ago
- Comments:22 (1 by maintainers)
Top Results From Across the Web
Xcode stuck on embedding provisioning profile
Xcode seems to be recompiling the bitcode while showing this message and it can, therefore, take a very long time to "embed the...
Read more >[ABANDONED] SES Startup stuck at "Building"
I'm still having the intermittent issue where SES locks up, and when I go into the task manager and attempt to restart SES...
Read more >Stuck Embedding CSS Link in HTML Page For SharePoint
I have a custom HTML page that is a single page that takes in WebParts. It is a new WebParts interface and there...
Read more >xCode stuck on "signing product" | Apple Developer Forums
It just hangs on the "Signing product" step. ... I've tried that "Code Sign on Copy" switch in the Embed Frameworks build phase...
Read more >Getting stuck on (creating connection in Model) while refreshing
I'm using the latest version of Power BI, i.e, 2.99.862.0 (November) and I have also turned off data load settings for autodetect and...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Really hard to say. I would start by seeing if
n_neighbors=30
works and take it from there. Obviously with such a large dataset, doubling parameters isn’t something to do lightly, but parameters for experimenting with the spectral initialization directly aren’t exposed through the UMAP interface, so it’s difficult to do anything else.init="random"
will work but it’s hard for UMAP (or any dimensionality reduction method that works in a similar way) to recover the global structure from a random start. If you have access to an efficient PCA package, then extracting the first two principal components (suitably scaled) and passing that as theinit
parameter would be a better starting point.It’s also possible that there is something in your dataset that is making the initialization take so long: are there lots of duplicates or close duplicates or all-zero rows? Bad behavior of the spectral initialization does seem to be related to the conditioning of the graph Laplacian matrix.
Your stack trace from the interrupt indicates that the problem is occurring at the spectral initialization stage. Where this has happened to me it seems to be when the graph is very nearly disconnected, but there are a few low-affinity edges that mean the disconnection detection routine still sees it as one connected graph.
If you are able to, try increasing
n_neighbors
.