Semantic Code Retrieval using Transformers
See original GitHub issueI am entering the world of transformers and would like to use some architectures to create a semantic search engine to retrieve source code (Python, Javascript, Ruby, Go, Java, and PHP code).
Currently, the dataset contains 2 million pairs (code, docstring), where code is a list of tokens from a method or function and docstring is a short description of the code in natural language.
As a starting point, it would be interesting to construct a model architecture that receives the code and the docstring ([ [code], [docstring] ]) as input example and outputs the code embedding and docstring embedding. Using cosine similarity as loss function the model could be fine-tuned to encode both code and docstring to the same embedding space. As shown in the figure below:
I started reading and tokenizing the dataset:
from transformers import BertTokenizer
# reads a list of [[code], [docstring]]
reader = CodeDocstringReader(dataset_path)
# loads tokenizer
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name, do_lower_case=True)
# returns a list of tokenized examples
# [[code_tokes_ids], [docstring_tokens_ids]]
tokenized_features = tokenizer_examples(
reader.get_examples(),
tokenizer
)
The definition and training of the model are still incomplete, but it is outlined as:
import tensorflow as tf
from transformers import BertModel
class JointEncoder(tf.keras.Model):
"""Encodes the code and docstring into an same space of embeddings."""
def __init__(self,
path,
name="jointencoder"):
super(JointEncoder, self).__init__(name=name)
self.bert = BertModel.from_pretrained(path)
def call(self, inputs):
"""Returns code and docstring embeddings"""
...
code_embedding = ..
docstring_embedding = ..
return code_embedding, docstring_embedding
However, I’m stuck on how to code this simple architecture. Could you give me some directions?
Thanks in advance.
Issue Analytics
- State:
- Created 4 years ago
- Comments:14 (7 by maintainers)
Yes it does run on colab!
maybe @LysandreJik or @sgugger have a link to a notebook?