Monika Raman, PSG College of Technology, Coimbatore
The Protein Data Bank archive (PDB) managed by the Worldwide Protein Data Bank (wwPDB) organization is the sole global repository of experimentally determined 3D structure data. The historical, human-readable PDB file format has been used to communicate Protein Data Bank (PDB) structures since 1970. On the other hand, rapid developments in experimental and approaches for structure identification like cryo-electron microscopy and integrative/hybrid methods quickly exposed their limitations. The new standard of the PDB, the PDBx/Macromolecular Crystallographic Information File format (PDBx/mmCIF), became the master format for the PDB archive in 2014.
Biomacromolecular structural data outgrew the legacy PDB format on which the scientific community relied for decades, yet the use of its successor PDBx/mmCIF is still not widespread. One factor could be the availability of easy-to-use tools that only support the legacy format. Another could be the inherent problems of accurately processing mmCIF files, given the large number of edge cases that make efficient parsing difficult. To properly utilize macromolecular structure data and their associated annotations, however, this new format must be widely adopted, as soon as possible.
What is PDBx/mmCIF format?
This site explains the wwPDB data content definition format, dictionaries, and related software tools for PDB entry deposition, annotation, and archiving.
It is an extension of the CIF format, which is the gold standard in small molecule crystallography. Each file contains one or more data blocks prefixed with ‘data_’ and populated with data items. A preceding underscore and a name are used to identify each data item. The name is of two parts: category and keyword, separated by a period. Key-value and tabular categories are the two types of categories. Tabular is an array of strings, whereas key-value is a single item of type string per keyword.
The PDBx/mmCIF format replaced the PDB file format to remove size limits on submitted structures and substantially enhance the representation of extra data provided with the coordinates.
PDBx/mmCIF files include programmatically available information on structural elements of macromolecular assemblies (category: pdbx_struct_assembly), details on assembly generation (pdbx_struct_assembly_gen), properties and features (pdbx_struct_assembly_prop), and much more. The PDBx/mmCIF Exchange Dictionary achieves this degree of clarity. It specifies how data item values are validated using data types, controlled dictionaries, and ranges.
The FAIR principles (Findable, Accessible, Interoperable, and Reusable) are followed while implementing a regulated dictionary. Even recently produced software may lack compatibility for the mmCIF format since several prominent software tools still rely on the outdated PDB format.
By making more software PDBx/mmCIF format compliant, the community would benefit from a faster acceptance rate of the new data standard. To facilitate this transition, Glen van Ginkel and colleagues, European Molecular Biology Laboratory, EMBL-EBI, Wellcome Genome Campus, Hinxton, UK, presented a lightweight, general-purpose Python package, PDBeCIF.
PDBeCIF package
The PDBeCIF package is available for download from PyPI or GitHub. PDBeCIF is a dependency-free Python 2/3 module that allows manipulation of mmCIF/CIF files issued by the wwPDB partners. This program supports reading from and writing on PDBx/mmCIF files. It also supports reading CIF files and providing numerous handy methods for searching the file content.
The package contains several classes.
- CifFileReader – For reading PDBx/mmCIF files.
- CifFileWriter – For writing PDBx/mmCIF files.
- CIFWrapper – It is a wrapper object that lets you use Python dot notation to access the file content and offer search methods for filtering data items using string criteria and regular expressions.
- CifFile data object – It enables simple changes to mmCIF content like addition and removal of categories and data items.
The parser has a method that allows undesirable categories to get discarded and desirable to get extracted, boosting parsing speed and memory efficiency even more.
Updated PDBx/ mmCIF files with new information are available via PDBe. These files add uniform and standardized metadata to the basic PDB archive information, allowing the core Exchange Dictionary to grow further.
PDBeCIF – Performance analysis
“We conducted a performance comparison between PDBeCIF v1.5 and some of the most prominent mmCIF parsers available in Python, such as Biopython v1.78, py-mmcif v0.67, and Atomium v1.0.9,” Ginkel explained. They averaged the results after measuring the running time on seven consecutive runs.
“For comparisons, we chose a tiny protein (PDB id: 1tqn) and a big molecular machine (PDB id: 7cgo),” he added. In both situations, the PDBeCIF was found to be the quickest, with a parsing duration of 0.3 and 2.28 seconds, respectively. Because PDBeCIF is a pure algorithmic parser with no structural interpretation, it is faster than Atomium or Biopython.
The project is open-source, which ensures its continued development and maintenance for manipulating mmCIF and CIF files. It can be easily integrated into any Python project or used as a format conversion interface between software modules, allowing for a wider acceptance of the PDBx/mmCIF format.
It is included in the wwPDB official list of mmCIF parsers and is used widely in PDB processes around Europe. It can easily be connected with third-party libraries and used for a wide range of scientific investigations.
Also read: Analysis of clinical characteristics of Takayasu’s arteritis patients
Source: Van Ginkel, G., Pravda, L., Dana, J. M., Varadi, M., Keller, P., Anyango, S., & Velankar, S. (2021a). PDBeCIF: An open-source mmCIF/CIF parsing and processing package. BMC Bioinformatics, 22(1), 383. https://doi.org/10.1186/s12859-021-04271-9
- The Corrosion Prediction from the Corrosion Product Performance
- Nitrogen Resilience in Waterlogged Soybean plants
- Cell Senescence in Type II Diabetes: Therapeutic Potential
- Transgene-Free Canker-Resistant Citrus sinensis with Cas12/RNP
- AI Literacy in Early Childhood Education: Challenges and Opportunities
About the author: Monika Raman is an undergraduate student pursuing her final year B. Tech in Biotechnology. She is an enthusiastic Biotech student aspiring for an opportunity to develop skills and grow professionally in the research field. Extremely motivated and possess strong interpersonal skills.
Depressive disorders and Pharmacotherapy: New info revealed!
Soumya Shraddhya Paul, Amity University, Noida In 2017, depressive disorders were the third biggest cause of non-fatal illness burden worldwide. In the treatment of depressive disorders, pharmacotherapy is an essential component. As of now, the treatment consists of monotherapy using second-generation antidepressants, such as selective serotonin reuptake inhibitors (SSRIs), serotonin and norepinephrine reuptake inhibitors (SNRIs), […]