Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Semantic Code Retrieval using Transformers

See original GitHub issue

I am entering the world of transformers and would like to use some architectures to create a semantic search engine to retrieve source code (Python, Javascript, Ruby, Go, Java, and PHP code).

Currently, the dataset contains 2 million pairs (code, docstring), where code is a list of tokens from a method or function and docstring is a short description of the code in natural language.

As a starting point, it would be interesting to construct a model architecture that receives the code and the docstring ([ [code], [docstring] ]) as input example and outputs the code embedding and docstring embedding. Using cosine similarity as loss function the model could be fine-tuned to encode both code and docstring to the same embedding space. As shown in the figure below:

I started reading and tokenizing the dataset:

    from transformers import BertTokenizer
    # reads a list of [[code], [docstring]]
    reader = CodeDocstringReader(dataset_path)
    
    # loads tokenizer
    model_name = "bert-base-uncased"
    tokenizer = BertTokenizer.from_pretrained(model_name, do_lower_case=True)

    # returns a list of tokenized examples
    # [[code_tokes_ids], [docstring_tokens_ids]]
    tokenized_features = tokenizer_examples(
        reader.get_examples(),
        tokenizer
    )

The definition and training of the model are still incomplete, but it is outlined as:

import tensorflow as tf
from transformers import BertModel


class JointEncoder(tf.keras.Model):
    """Encodes the code and docstring into an same space of embeddings."""

    def __init__(self,
                 path,
                 name="jointencoder"):
        super(JointEncoder, self).__init__(name=name)
        self.bert = BertModel.from_pretrained(path)

    def call(self, inputs):
        """Returns code and docstring embeddings"""
         ...
         code_embedding = ..
         docstring_embedding = ..

        return code_embedding, docstring_embedding

However, I’m stuck on how to code this simple architecture. Could you give me some directions?

Thanks in advance.