question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

General discrepancies between ProteinNet and mmCIF/PDB files (biopython)

See original GitHub issue

I have some code that fetches mmCIF files for each entry in CASP11 using BioPython. A substantial proportion of examples fail the various checks, though some pass. Many of them are missing the corresponding model, and most that have the given model disagree on primary sequence length. Perhaps there is an obvious explanation for this, and I simply overlooked it. Subtracting one from model_id allows most of the models to be resolved, but many of the primary sequences have significant length mismatch. Most of the files only have the one model, so it’s unclear what is exactly is going on here.

#!/usr/bin/python

# imports
import sys
import re

# Constants
NUM_DIMENSIONS = 3

# Functions for conversion from Mathematica protein files to TFRecords
_aa_dict = {
    'A': '0',
    'C': '1',
    'D': '2',
    'E': '3',
    'F': '4',
    'G': '5',
    'H': '6',
    'I': '7',
    'K': '8',
    'L': '9',
    'M': '10',
    'N': '11',
    'P': '12',
    'Q': '13',
    'R': '14',
    'S': '15',
    'T': '16',
    'V': '17',
    'W': '18',
    'Y': '19'
}
_dssp_dict = {
    'L': '0',
    'H': '1',
    'B': '2',
    'E': '3',
    'G': '4',
    'I': '5',
    'T': '6',
    'S': '7'
}
_mask_dict = {'-': '0', '+': '1'}


class switch(object):
    """Switch statement for Python, based on recipe from Python Cookbook."""
    def __init__(self, value):
        self.value = value
        self.fall = False

    def __iter__(self):
        """Return the match method once, then stop"""
        yield self.match

    def match(self, *args):
        """Indicate whether or not to enter a case suite"""
        if self.fall or not args:
            return True
        elif self.value in args:  # changed for v1.5
            self.fall = True
            return True
        else:
            return False


def letter_to_num(string, dict_):
    """ Convert string of letters to list of ints """
    patt = re.compile('[' + ''.join(dict_.keys()) + ']')
    num_string = patt.sub(lambda m: dict_[m.group(0)] + ' ', string)
    num = [int(i) for i in num_string.split()]
    return num


def read_record(file_, num_evo_entries):
    """ Read a Mathematica protein record from file and convert into dict. """

    dict_ = {}

    while True:
        next_line = file_.readline()
        for case in switch(next_line):
            if case('[ID]' + '\n'):
                id_ = file_.readline()[:-1]
                dict_.update({'id': id_})
            elif case('[PRIMARY]' + '\n'):
                primary = letter_to_num(file_.readline()[:-1], _aa_dict)
                dict_.update({'primary': primary})
            elif case('[EVOLUTIONARY]' + '\n'):
                evolutionary = []
                for residue in range(num_evo_entries):
                    evolutionary.append(
                        [float(step) for step in file_.readline().split()])
                dict_.update({'evolutionary': evolutionary})
            elif case('[SECONDARY]' + '\n'):
                secondary = letter_to_num(file_.readline()[:-1], _dssp_dict)
                dict_.update({'secondary': secondary})
            elif case('[TERTIARY]' + '\n'):
                tertiary = []
                for axis in range(NUM_DIMENSIONS):
                    tertiary.append(
                        [float(coord) for coord in file_.readline().split()])
                dict_.update({'tertiary': tertiary})
            elif case('[MASK]' + '\n'):
                mask = letter_to_num(file_.readline()[:-1], _mask_dict)
                dict_.update({'mask': mask})
            elif case('\n'):
                return dict_
            elif case(''):
                return None


if __name__ == '__main__':
    from Bio.PDB.MMCIFParser import MMCIFParser
    from Bio.PDB.PDBParser import PDBParser
    from Bio.PDB import PDBIO, PDBList
    from Bio.PDB.Structure import Structure, Entity
    from Bio.PDB.Model import Model
    from Bio.PDB.Residue import Residue
    from Bio.PDB.Chain import Chain
    from Bio.PDB.Atom import Atom
    import numpy as np

    input_path = 'D:/casp11/training_30'
    print(f'Reading data from {input_path}')
    num_evo_entries = int(sys.argv[2]) if len(
        sys.argv) == 3 else 20  # default number of evo entries

    input_file = open(input_path, 'r')
    pdbl = PDBList()

    while True:
        record = read_record(input_file, num_evo_entries)
        if record is not None:
            id = record["id"]
            primary = record['primary']
            primary_len = len(primary)
            parts = id.split('_')
            if len(parts) != 3:
                # https://github.com/aqlaboratory/proteinnet/issues/1#issuecomment-375270286
                continue
            pdb_id = parts[0]
            path = pdbl.retrieve_pdb_file(pdb_id, pdir='pdb/', file_format='mmCif')
            parser = MMCIFParser()
            structure = parser.get_structure(pdb_id, path)
            model_id = int(parts[1])
            assert model_id >= 0
            chain_id = parts[2]
            if not model_id in structure.child_dict:
                print(f'{pdb_id} lacks model {model_id}, models: {structure.child_dict}')
                continue
            model = structure.child_dict[model_id]
            assert isinstance(model, Model)
            chain = model.child_dict[chain_id]
            if not chain_id in model.child_dict:
                print(f'{pdb_id} model {model_id} lacks chain {chain_id}, chains: {model.child_dict}')
                continue
            chain_len = len(chain.child_list)
            assert isinstance(chain, Chain)
            if chain_len != primary_len:
                print(f'{pdb_id} chain ({chain_len}) and primary ({primary_len}) lengths mismatch')
            else:
                print(f'{pdb_id} is all good!')
        else:
            input_file.close()
            break

Sampled output:

Reading data from D:/casp11/training_30
Structure exists: 'pdb/2yo0.cif'
2YO0 lacks model 1, models: {0: <Model id=0>}
Structure exists: 'pdb/4jrn.cif'
4JRN lacks model 1, models: {0: <Model id=0>}
Structure exists: 'pdb/4hxf.cif'
4HXF lacks model 1, models: {0: <Model id=0>}
Structure exists: 'pdb/2l0e.cif'
2L0E is all good!
Structure exists: 'pdb/2kxi.cif'
2KXI is all good!
Structure exists: 'pdb/2o57.cif'
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain A is discontinuous at line 8744.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain B is discontinuous at line 9031.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain C is discontinuous at line 9318.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain D is discontinuous at line 9580.
  warnings.warn(
2O57 lacks model 1, models: {0: <Model id=0>}
Structure exists: 'pdb/4lhr.cif'
4LHR lacks model 1, models: {0: <Model id=0>}
Structure exists: 'pdb/3zrg.cif'
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain A is discontinuous at line 1020.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain B is discontinuous at line 1024.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain A is discontinuous at line 1027.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain B is discontinuous at line 1097.
  warnings.warn(
3ZRG lacks model 1, models: {0: <Model id=0>}
Structure exists: 'pdb/2k0m.cif'
2K0M is all good!
Structure exists: 'pdb/4hhx.cif'
4HHX lacks model 1, models: {0: <Model id=0>}
Structure exists: 'pdb/4ld3.cif'
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain A is discontinuous at line 1362.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain B is discontinuous at line 1382.
  warnings.warn(
4LD3 lacks model 2, models: {0: <Model id=0>}
Structure exists: 'pdb/3wo6.cif'
3WO6 lacks model 1, models: {0: <Model id=0>}
Structure exists: 'pdb/2kkp.cif'
2KKP is all good!
Structure exists: 'pdb/4jja.cif'
4JJA lacks model 1, models: {0: <Model id=0>}
Structure exists: 'pdb/4h5s.cif'
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain A is discontinuous at line 1565.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain B is discontinuous at line 1733.
  warnings.warn(
4H5S lacks model 1, models: {0: <Model id=0>}
Structure exists: 'pdb/4aez.cif'
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain A is discontinuous at line 16578.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain B is discontinuous at line 16692.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain C is discontinuous at line 16725.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain D is discontinuous at line 16758.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain E is discontinuous at line 16819.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain F is discontinuous at line 16820.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain G is discontinuous at line 16868.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain H is discontinuous at line 16889.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain I is discontinuous at line 16895.
  warnings.warn(
4AEZ lacks model 3, models: {0: <Model id=0>}
Structure exists: 'pdb/4gdo.cif'
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain A is discontinuous at line 3247.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain B is discontinuous at line 3295.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain C is discontinuous at line 3336.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain D is discontinuous at line 3362.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain E is discontinuous at line 3401.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain F is discontinuous at line 3403.
  warnings.warn(
4GDO lacks model 1, models: {0: <Model id=0>}
Structure exists: 'pdb/3up6.cif'
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain A is discontinuous at line 4854.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain B is discontinuous at line 4927.
  warnings.warn(
3UP6 lacks model 1, models: {0: <Model id=0>}
Downloading PDB structure '3UKW'...
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain B is discontinuous at line 3473.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain C is discontinuous at line 3717.
  warnings.warn(
3UKW lacks model 2, models: {0: <Model id=0>}
Downloading PDB structure '3SEO'...
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain A is discontinuous at line 3510.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain B is discontinuous at line 3511.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain A is discontinuous at line 3519.
  warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain B is discontinuous at line 3539.
  warnings.warn(
3SEO lacks model 1, models: {0: <Model id=0>}

Related issue https://github.com/aqlaboratory/proteinnet/issues/13

Thanks for all the good work.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:7 (1 by maintainers)

github_iconTop GitHub Comments

2reactions
alquraishicommented, Feb 13, 2020

@lucidrains if you’d like to download the original raw PDBs for CASP12 (just what CASP released, i.e. the ProteinNet test set), you can grab them from the CASP site itself, here.

@thavlik without seeing specific examples of the discrepancy (i.e. a sequence you’re getting versus what ProteinNet has), it’s difficult to tell what the issue is. My best guess is that you’re taking the sequence of the resolved residues in a PDB file–i.e. you’re concatenating a bunch of discontiguous segments. This would be incorrect and non-physical. Virtually all PDB files have missing residues and so the sequence in the PDB file has gaps in it. ProteinNet entries have the full sequences, which is the physical object that actually folds, along with mask fields that indicate which residues are missing from the structure, so that you can correctly map the sequence to the structure and also account for the missing residues in the loss function, etc. I am not 100% sure this is what’s going on, but that’s my best guess like I said. If you can paste a specific sequence from ProteinNet and your pipeline it would be easier to compare the two, and see if you are in fact concatenating discontiguous segments.

1reaction
lucidrainscommented, Mar 5, 2020

@thavlik @alquraishi thank you both!

Read more comments on GitHub >

github_iconTop Results From Across the Web

The Biopython Structural Bioinformatics FAQ
Note however that many PDB files contain headers with incomplete or erroneous information. Many of the errors have been fixed in the equivalent...
Read more >
Biopython Tutorial and Cookbook
2.1 General overview of what Biopython provides; 2.2 Working with sequences; 2.3 A usage example; 2.4 Parsing sequence file formats.
Read more >
The Biopython Structural Bioinformatics FAQ
Then, create a structure object from a PDB file in the following way (the ... In the restrictive state, PDB files with errors...
Read more >
Bio.SeqUtils package — Biopython 1.75 documentation
Calculate G+C content, return percentage (as float between 0 and 100). Copes mixed case sequences, and with the ambiguous nucleotide S (G or...
Read more >
Bio.Seq module — Biopython 1.76 documentation
SeqIO to read in sequences from files as SeqRecord objects, whose sequence will be exposed as a ... we get the more general...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found