question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Semantic Code Retrieval using Transformers

See original GitHub issue

I am entering the world of transformers and would like to use some architectures to create a semantic search engine to retrieve source code (Python, Javascript, Ruby, Go, Java, and PHP code).

Currently, the dataset contains 2 million pairs (code, docstring), where code is a list of tokens from a method or function and docstring is a short description of the code in natural language.

As a starting point, it would be interesting to construct a model architecture that receives the code and the docstring ([ [code], [docstring] ]) as input example and outputs the code embedding and docstring embedding. Using cosine similarity as loss function the model could be fine-tuned to encode both code and docstring to the same embedding space. As shown in the figure below:

            

I started reading and tokenizing the dataset:

    from transformers import BertTokenizer
    # reads a list of [[code], [docstring]]
    reader = CodeDocstringReader(dataset_path)
    
    # loads tokenizer
    model_name = "bert-base-uncased"
    tokenizer = BertTokenizer.from_pretrained(model_name, do_lower_case=True)

    # returns a list of tokenized examples
    # [[code_tokes_ids], [docstring_tokens_ids]]
    tokenized_features = tokenizer_examples(
        reader.get_examples(),
        tokenizer
    )

The definition and training of the model are still incomplete, but it is outlined as:

import tensorflow as tf
from transformers import BertModel


class JointEncoder(tf.keras.Model):
    """Encodes the code and docstring into an same space of embeddings."""

    def __init__(self,
                 path,
                 name="jointencoder"):
        super(JointEncoder, self).__init__(name=name)
        self.bert = BertModel.from_pretrained(path)

    def call(self, inputs):
        """Returns code and docstring embeddings"""
         ...
         code_embedding = ..
         docstring_embedding = ..

        return code_embedding, docstring_embedding 

However, I’m stuck on how to code this simple architecture. Could you give me some directions?

Thanks in advance.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:14 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
LysandreJikcommented, Oct 14, 2020

Yes it does run on colab!

1reaction
julien-ccommented, Oct 13, 2020

maybe @LysandreJik or @sgugger have a link to a notebook?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Code SEMANTIC Information Retrieval System w - YouTube
Code SEMANTIC Information Retrieval (IR) System [ SBERT + PyTorch ] ... Keyword: Document and Query Indexing with Transformers (BERT, SBERT).
Read more >
How to Build a Semantic Search Engine With Transformers ...
In this tutorial, we built a vector-based search engine using Sentence Transformers and Faiss. Our index works well but it's fairly simple. We ......
Read more >
Semantic Search with Few Lines of Code | by Fabio Chiusano
Semantic search is a data searching and information retrieval technique that allows retrieving documents from a corpus using a search query ...
Read more >
On the Effectiveness of Transfer Learning for Code Search
Code search, or code retrieval, is the task of retrieving source ... code search, using Transformers and transfer learning in the form of...
Read more >
Semantic Search — Sentence-Transformers documentation
This function performs a cosine similarity search between a list of query embeddings and a list of corpus embeddings. It can be used...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found