DeepSpeed and nn.Embedding issue
See original GitHub issueHi Lucidrains First of all thanks for the contribution. You are doing an awesome job here.
I’m trying to implement the Seq2Seq model using DeepSpeed since I will have 32k seq_len as input. This is my code: ` CODE:
class GenomeToMolDataset(Dataset):
def __init__(self, data, src_lang, trg_lang):
super().__init__()
self.data = data
self.src_lang = src_lang
self.trg_lang = trg_lang
def __getitem__(self, index):
#print(index)
pair = self.data[index]
#print('src:',pair[0])
#print('\n\ntrg:',pair[1])
src = torch.tensor(indexesFromSentence(self.src_lang,pair[0]))
trg = torch.tensor(indexesFromSentence(self.trg_lang,pair[1]))
print('src:', src)
print('trg:', trg)
return src,trg
def __len__(self):
return len(self.data)
train_dataset = GenomeToMolDataset(tr_pairs, input_lang, target_lang)
test_dataset = GenomeToMolDataset(ts_pairs, input_lang, target_lang)
encoder = ReformerLM(
num_tokens = input_lang.n_words,
emb_dim = emb_dim,#128,
dim = dim,#512,
bucket_size = bucket_size, # 16,
depth = depth, # 6,
heads = heads, # 8,
n_hashes= n_hashes,
max_seq_len = VIR_SEQ_LEN,
ff_chunks = ff_chunks, #400, # number of chunks for feedforward layer, make higher if there are memory issues
attn_chunks = attn_chunks, #16, # process lsh attention in chunks, only way for memory to fit when scaling to 16k tokens
#weight_tie = True,
fixed_position_emb = True,
return_embeddings = True # return output of last attention layer
).cuda()
decoder = ReformerLM(
num_tokens = target_lang.n_words,
emb_dim = emb_dim, # 128,
dim = dim, # 512,
bucket_size = bucket_size, #16,
depth = depth, #6,
heads = heads, #8,
n_hashes= n_hashes,
ff_chunks = ff_chunks, # 400, # number of chunks for feedforward layer, make higher if there are memory issues
attn_chunks = attn_chunks, # 16, # process lsh attention in chunks, only way for memory to fit when scaling to 16k tokens
max_seq_len = MOL_SEQ_LEN,
fixed_position_emb = True,
causal = True
).cuda()
encoder_optimizer = RangerLars(encoder.parameters()) # torch.optim.Adam(encoder.parameters(), lr=learning_rate)
decoder_optimizer = RangerLars(decoder.parameters()) # torch.optim.Adam(decoder.parameters(), lr=learning_rate)
if use_apex:
encoder, encoder_optimizer = amp.initialize(encoder, encoder_optimizer, opt_level='O1')
decoder, decoder_optimizer = amp.initialize(decoder, decoder_optimizer, opt_level='O1')
encoder = TrainingWrapper(encoder).cuda()
#encoder.cuda()
decoder = TrainingWrapper(decoder).cuda()
#decoder.cuda()
encoder_params = filter(lambda p: p.requires_grad, encoder.parameters())
decoder_params = filter(lambda p: p.requires_grad, decoder.parameters())
encoder_engine, encoder_optimizer, trainloader, _ = deepspeed.initialize(args=cmd_args, model=encoder, optimizer=encoder_optimizer, model_parameters=encoder_params, training_data=train_dataset, dist_init_required=True)
decoder_engine, decoder_optimizer, _, _ = deepspeed.initialize(args=cmd_args, model=decoder, optimizer=decoder_optimizer, model_parameters=encoder_params, dist_init_required=False)
# training
VALIDATE_EVERY = 1
SAVE_EVERY = 10
SAVE_DIR = './saved_model/'
_, encoder_client_sd = encoder_engine.load_checkpoint(SAVE_DIR+'encoder/', None)
_, decoder_client_sd = decoder_engine.load_checkpoint(SAVE_DIR+'decoder/', None) #args.ckpt_id
for i, pair in enumerate(trainloader):
src = pair[0]
trg = pair[1]
encoder_engine.train()
decoder_engine.train()
src = src.to(encoder_engine.local_rank)
trg = trg.to(decoder_engine.local_rank)
print(src.shape)
print(src.dtype)
print(trg.shape)
print(trg.dtype)
enc_keys = encoder_engine(src)
loss = decoder_engine(trg, keys = enc_keys, return_loss = True) # (1, 4096, 20000)
encoder_engine.backward(loss)
decoder_engine.backward(loss)
encoder_engine.step()
decoder_engine.step()
print('Training Loss:',loss.item())
if i % VALIDATE_EVERY == 0:
encoder.eval()
decoder.eval()
with torch.no_grad():
ts_src,ts_trg = random.choice(test_dataset)[:-1]
enc_keys = encoder(ts_src.to(device))
loss = decoder(ts_trg, keys=enc_keys, return_loss = True)
print(f'\tValidation Loss: {loss.item()}')
if i % SAVE_EVERY:
encoder_client_sd['step'] = i
decoder_client_sd['step'] = i
ckpt_id = loss.item()
encoder_engine.save_checkpoint(SAVE_DIR+'encoder/', ckpt_id, client_sd = encoder_client_sd)
decoder_engine.save_checkpoint(SAVE_DIR+'decoder/', ckpt_id, client_sd = decoder_client_sd)`
The issue I’m having is with the nn.Embedding Layer since it wants Long integer as input but DeepSpeed works only with Floats. And it prompts this error:
RuntimeError: expected device cuda:0 and dtype Float but got device cuda:0 and dtype Long
If I cast to float the inputs, then the Embedding layer will prompt the vice versa error.
How can I use your ReformerLM as Encoder-Decoder with DeepSpeed in this case? Is there any way I can workaround the Embedding issue?
Thank you, Cal
Issue Analytics
- State:
- Created 4 years ago
- Comments:7 (3 by maintainers)
Top Results From Across the Web
DeepSpeed Configuration JSON
Enable sparse compression of torch.nn.Embedding gradients. This feature is essentially deprecated as we don't see use cases for it as much anymore.
Read more >Source code for deepspeed.runtime.pipe.module
Module): raise RuntimeError('LayerSpec only supports torch.nn. ... This is a problem in DeepSpeed because we often allocate tensors using slices of large ...
Read more >transformers.modeling_utils — transformers 4.11.3 documentation
Returns: :obj:`nn.Module`: A torch module mapping hidden states to vocabulary. """ return None # Overwrite for models with output embeddings.
Read more >benchmark assessment for deepspeed optimization library
deal with DL complexity and efficiency issues. ... Keywords Machine Learning · Neural Networks · Deep Learning Models · Optimization Models.
Read more >revlib - PyPI
Simple and efficient RevNet-Library with DeepSpeed support. ... AbsolutePositionalEmbedding import revlib class Reformer(torch.nn.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@lucidrains this is the repo for the virus project: https://github.com/CalogeroZarbo/bioshield
I checked the new version of the library with the positional embedding and it works like a charm. Thank you for the fix!
@CalogeroZarbo Thank you for the trace! I believe you caught a bug with my sinusoidal positional encoding implementation, and it has been fixed in the latest version (I hope, please let me know).
That doesn’t sound silly at all, and I think we are largely on the same page. Research is trickling in that attention may work well for chemicals and molecules. There’s a lot left to explore. https://arxiv.org/abs/2002.08264 and https://twitter.com/EricTopol/status/1229150936028733440?s=19
Please share the database if you can! I would love to get involved. I played around with SMILES myself and have a generative model for chemicals up at https://thischemicaldoesnotexist.com using Reformer.
Finally, as a fellow practitioner, I’ve been thinking about how deep learning can be applied to this crisis. Evidence shows that deep learning can greatly speed up simulations (https://arxiv.org/abs/2001.08055), and I was wondering if perhaps it will be fruitful to train a differentiable docking function, perhaps specific to the Spike protein of Covid. Such a module could eventually be used in some end-to-end pipeline for evaluating candidates? Anyways, I am much an amateur in this arena, but those are my thoughts.