General discrepancies between ProteinNet and mmCIF/PDB files (biopython)
See original GitHub issueI have some code that fetches mmCIF files for each entry in CASP11 using BioPython. A substantial proportion of examples fail the various checks, though some pass. Many of them are missing the corresponding model, and most that have the given model disagree on primary sequence length. Perhaps there is an obvious explanation for this, and I simply overlooked it. Subtracting one from model_id allows most of the models to be resolved, but many of the primary sequences have significant length mismatch. Most of the files only have the one model, so it’s unclear what is exactly is going on here.
#!/usr/bin/python
# imports
import sys
import re
# Constants
NUM_DIMENSIONS = 3
# Functions for conversion from Mathematica protein files to TFRecords
_aa_dict = {
'A': '0',
'C': '1',
'D': '2',
'E': '3',
'F': '4',
'G': '5',
'H': '6',
'I': '7',
'K': '8',
'L': '9',
'M': '10',
'N': '11',
'P': '12',
'Q': '13',
'R': '14',
'S': '15',
'T': '16',
'V': '17',
'W': '18',
'Y': '19'
}
_dssp_dict = {
'L': '0',
'H': '1',
'B': '2',
'E': '3',
'G': '4',
'I': '5',
'T': '6',
'S': '7'
}
_mask_dict = {'-': '0', '+': '1'}
class switch(object):
"""Switch statement for Python, based on recipe from Python Cookbook."""
def __init__(self, value):
self.value = value
self.fall = False
def __iter__(self):
"""Return the match method once, then stop"""
yield self.match
def match(self, *args):
"""Indicate whether or not to enter a case suite"""
if self.fall or not args:
return True
elif self.value in args: # changed for v1.5
self.fall = True
return True
else:
return False
def letter_to_num(string, dict_):
""" Convert string of letters to list of ints """
patt = re.compile('[' + ''.join(dict_.keys()) + ']')
num_string = patt.sub(lambda m: dict_[m.group(0)] + ' ', string)
num = [int(i) for i in num_string.split()]
return num
def read_record(file_, num_evo_entries):
""" Read a Mathematica protein record from file and convert into dict. """
dict_ = {}
while True:
next_line = file_.readline()
for case in switch(next_line):
if case('[ID]' + '\n'):
id_ = file_.readline()[:-1]
dict_.update({'id': id_})
elif case('[PRIMARY]' + '\n'):
primary = letter_to_num(file_.readline()[:-1], _aa_dict)
dict_.update({'primary': primary})
elif case('[EVOLUTIONARY]' + '\n'):
evolutionary = []
for residue in range(num_evo_entries):
evolutionary.append(
[float(step) for step in file_.readline().split()])
dict_.update({'evolutionary': evolutionary})
elif case('[SECONDARY]' + '\n'):
secondary = letter_to_num(file_.readline()[:-1], _dssp_dict)
dict_.update({'secondary': secondary})
elif case('[TERTIARY]' + '\n'):
tertiary = []
for axis in range(NUM_DIMENSIONS):
tertiary.append(
[float(coord) for coord in file_.readline().split()])
dict_.update({'tertiary': tertiary})
elif case('[MASK]' + '\n'):
mask = letter_to_num(file_.readline()[:-1], _mask_dict)
dict_.update({'mask': mask})
elif case('\n'):
return dict_
elif case(''):
return None
if __name__ == '__main__':
from Bio.PDB.MMCIFParser import MMCIFParser
from Bio.PDB.PDBParser import PDBParser
from Bio.PDB import PDBIO, PDBList
from Bio.PDB.Structure import Structure, Entity
from Bio.PDB.Model import Model
from Bio.PDB.Residue import Residue
from Bio.PDB.Chain import Chain
from Bio.PDB.Atom import Atom
import numpy as np
input_path = 'D:/casp11/training_30'
print(f'Reading data from {input_path}')
num_evo_entries = int(sys.argv[2]) if len(
sys.argv) == 3 else 20 # default number of evo entries
input_file = open(input_path, 'r')
pdbl = PDBList()
while True:
record = read_record(input_file, num_evo_entries)
if record is not None:
id = record["id"]
primary = record['primary']
primary_len = len(primary)
parts = id.split('_')
if len(parts) != 3:
# https://github.com/aqlaboratory/proteinnet/issues/1#issuecomment-375270286
continue
pdb_id = parts[0]
path = pdbl.retrieve_pdb_file(pdb_id, pdir='pdb/', file_format='mmCif')
parser = MMCIFParser()
structure = parser.get_structure(pdb_id, path)
model_id = int(parts[1])
assert model_id >= 0
chain_id = parts[2]
if not model_id in structure.child_dict:
print(f'{pdb_id} lacks model {model_id}, models: {structure.child_dict}')
continue
model = structure.child_dict[model_id]
assert isinstance(model, Model)
chain = model.child_dict[chain_id]
if not chain_id in model.child_dict:
print(f'{pdb_id} model {model_id} lacks chain {chain_id}, chains: {model.child_dict}')
continue
chain_len = len(chain.child_list)
assert isinstance(chain, Chain)
if chain_len != primary_len:
print(f'{pdb_id} chain ({chain_len}) and primary ({primary_len}) lengths mismatch')
else:
print(f'{pdb_id} is all good!')
else:
input_file.close()
break
Sampled output:
Reading data from D:/casp11/training_30
Structure exists: 'pdb/2yo0.cif'
2YO0 lacks model 1, models: {0: <Model id=0>}
Structure exists: 'pdb/4jrn.cif'
4JRN lacks model 1, models: {0: <Model id=0>}
Structure exists: 'pdb/4hxf.cif'
4HXF lacks model 1, models: {0: <Model id=0>}
Structure exists: 'pdb/2l0e.cif'
2L0E is all good!
Structure exists: 'pdb/2kxi.cif'
2KXI is all good!
Structure exists: 'pdb/2o57.cif'
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain A is discontinuous at line 8744.
warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain B is discontinuous at line 9031.
warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain C is discontinuous at line 9318.
warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain D is discontinuous at line 9580.
warnings.warn(
2O57 lacks model 1, models: {0: <Model id=0>}
Structure exists: 'pdb/4lhr.cif'
4LHR lacks model 1, models: {0: <Model id=0>}
Structure exists: 'pdb/3zrg.cif'
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain A is discontinuous at line 1020.
warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain B is discontinuous at line 1024.
warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain A is discontinuous at line 1027.
warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain B is discontinuous at line 1097.
warnings.warn(
3ZRG lacks model 1, models: {0: <Model id=0>}
Structure exists: 'pdb/2k0m.cif'
2K0M is all good!
Structure exists: 'pdb/4hhx.cif'
4HHX lacks model 1, models: {0: <Model id=0>}
Structure exists: 'pdb/4ld3.cif'
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain A is discontinuous at line 1362.
warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain B is discontinuous at line 1382.
warnings.warn(
4LD3 lacks model 2, models: {0: <Model id=0>}
Structure exists: 'pdb/3wo6.cif'
3WO6 lacks model 1, models: {0: <Model id=0>}
Structure exists: 'pdb/2kkp.cif'
2KKP is all good!
Structure exists: 'pdb/4jja.cif'
4JJA lacks model 1, models: {0: <Model id=0>}
Structure exists: 'pdb/4h5s.cif'
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain A is discontinuous at line 1565.
warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain B is discontinuous at line 1733.
warnings.warn(
4H5S lacks model 1, models: {0: <Model id=0>}
Structure exists: 'pdb/4aez.cif'
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain A is discontinuous at line 16578.
warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain B is discontinuous at line 16692.
warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain C is discontinuous at line 16725.
warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain D is discontinuous at line 16758.
warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain E is discontinuous at line 16819.
warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain F is discontinuous at line 16820.
warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain G is discontinuous at line 16868.
warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain H is discontinuous at line 16889.
warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain I is discontinuous at line 16895.
warnings.warn(
4AEZ lacks model 3, models: {0: <Model id=0>}
Structure exists: 'pdb/4gdo.cif'
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain A is discontinuous at line 3247.
warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain B is discontinuous at line 3295.
warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain C is discontinuous at line 3336.
warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain D is discontinuous at line 3362.
warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain E is discontinuous at line 3401.
warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain F is discontinuous at line 3403.
warnings.warn(
4GDO lacks model 1, models: {0: <Model id=0>}
Structure exists: 'pdb/3up6.cif'
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain A is discontinuous at line 4854.
warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain B is discontinuous at line 4927.
warnings.warn(
3UP6 lacks model 1, models: {0: <Model id=0>}
Downloading PDB structure '3UKW'...
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain B is discontinuous at line 3473.
warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain C is discontinuous at line 3717.
warnings.warn(
3UKW lacks model 2, models: {0: <Model id=0>}
Downloading PDB structure '3SEO'...
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain A is discontinuous at line 3510.
warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain B is discontinuous at line 3511.
warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain A is discontinuous at line 3519.
warnings.warn(
C:\Users\tlhavlik\AppData\Local\Programs\Python\Python38\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain B is discontinuous at line 3539.
warnings.warn(
3SEO lacks model 1, models: {0: <Model id=0>}
Related issue https://github.com/aqlaboratory/proteinnet/issues/13
Thanks for all the good work.
Issue Analytics
- State:
- Created 4 years ago
- Comments:7 (1 by maintainers)
Top Results From Across the Web
The Biopython Structural Bioinformatics FAQ
Note however that many PDB files contain headers with incomplete or erroneous information. Many of the errors have been fixed in the equivalent...
Read more >Biopython Tutorial and Cookbook
2.1 General overview of what Biopython provides; 2.2 Working with sequences; 2.3 A usage example; 2.4 Parsing sequence file formats.
Read more >The Biopython Structural Bioinformatics FAQ
Then, create a structure object from a PDB file in the following way (the ... In the restrictive state, PDB files with errors...
Read more >Bio.SeqUtils package — Biopython 1.75 documentation
Calculate G+C content, return percentage (as float between 0 and 100). Copes mixed case sequences, and with the ambiguous nucleotide S (G or...
Read more >Bio.Seq module — Biopython 1.76 documentation
SeqIO to read in sequences from files as SeqRecord objects, whose sequence will be exposed as a ... we get the more general...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
@lucidrains if you’d like to download the original raw PDBs for CASP12 (just what CASP released, i.e. the ProteinNet test set), you can grab them from the CASP site itself, here.
@thavlik without seeing specific examples of the discrepancy (i.e. a sequence you’re getting versus what ProteinNet has), it’s difficult to tell what the issue is. My best guess is that you’re taking the sequence of the resolved residues in a PDB file–i.e. you’re concatenating a bunch of discontiguous segments. This would be incorrect and non-physical. Virtually all PDB files have missing residues and so the sequence in the PDB file has gaps in it. ProteinNet entries have the full sequences, which is the physical object that actually folds, along with mask fields that indicate which residues are missing from the structure, so that you can correctly map the sequence to the structure and also account for the missing residues in the loss function, etc. I am not 100% sure this is what’s going on, but that’s my best guess like I said. If you can paste a specific sequence from ProteinNet and your pipeline it would be easier to compare the two, and see if you are in fact concatenating discontiguous segments.
@thavlik @alquraishi thank you both!