Recombinant Protein Expression
|
|
logit(p)
|
Odd (p/(1-p))
|
p
|
Rank order vs E. coli
|
E. coli Expression | ||||
Insect Cells Expression | ||||
Mammalian Cells Expression | ||||
Yeast Cells Expression |
Introduction
Instructions
Identify the amino acid sequence for expression and set it in one-letter codes, preferably in FASTA format. Ideally, you would want to know the entire coding sequence including N-terminal Met, signal peptide, and internal processing sites if present. Familiarity with protein databases and various prediction webservers would also be required to execute this prediction.
Gather the following nine parameters (predictor variables) related to your proteins:
1.The total number of amino acid residues (r) for the end-product of expression
2.The corresponding isoelectric point (pI)
3.The GRAVY hydropathicity index (hp)
4.The instability index (ii)
5.The tertiary structural class index (t)
6.The membrane association status (m)
7.The quaternary structural class index (q)
8.The predicted number of N-glycosylation sites (ng)
9.The molecular function (mf)
Here’s how to collect them:
Items 1-4 (r, pI, hp and ii) could be obtained by submitting your sequence to the ProtParam tool of the ExPASy server.
The remaining items 5 (t), 6 (m), 7 (q) and 9 (mf) can only be accurately determined if the exact sequence could be identified from protein databases (e.g., NCBI Reference Proteins (Refseq protein), UniprotKB/Swiss-Pro (swissprot), PDB). By default, a homolog/ortholog with a high degree of amino acid sequence identity (>50%) or homology (>80%) could be used as substitute. Once the relevant sequence has been identified in the database, search for the corresponding annotations, then numerically code the information as follows:
Item 5, tertiary structural classes (t):
The tertiary structural classes (t) could be determined from annotations of the relevant PDB entries when available, or alternatively by direct visualization using the 3DView module in Mol*. When an experimentally determined structure is not available the tertiary structural class could be obtained from a predicted structure for the exact protein sequence, or by default the closest homolog or ortholog, obtained from the AlphaFold Protein Structure Database. The classification of tertiary structural classes was based on SCOP with minor modifications. Five classes are recognized and numerically coded 1 to 5, respectively:
Small protein (sp): 1
All alpha (α): 2
All beta (β): 3
Alternating alpha beta (α/β): 4
Complex protein (α+β) with multiple domains containing any combinations of α, β, α/β, and on rare occasion sp: 5
Item 6, membrane association status (m):
Three states of membrane association are represented and numerically coded 0,1,2, respectively:
Soluble with no known membrane association: 0
An nth order transmembrane protein: 1
A protein without transmembrane helices but annotated as directly associated with membrane, including via post translational modification with lipid anchors: 2
In the absence of adequate annotations, membrane association could be determined by prediction of transmembrane helices using TMHMM and Phobius, by prediction of lipid anchorage sites using GPS-Lipid (forS-Palmitoylation, N-Myristoylation, S-Farnesylation and S-Geranylgeranylation) and big-PI Predictor (for GPI anchor), and by comprehensive subcellular localization using Deeploc.
Item 7, quaternary structural classes (m):
Use the annotations in the PDB and UniprotKB for quaternary structural classification (q). Five categories are recognized and coded 1-5, respectively:
Monomer: 1
Homopolymer of nth order: 2
Homopolymer aggregate of undefined size: 3
Heteropolymer of nth order: 4
Heteropolymer aggregate of undefined size: 5
Item 8 (ng) could be obtained by submitting the sequence (including signal peptide) to the NetNGlyc-1.0 server of DTU Health Tech.
Item 9 (mf), molecular function (mf):
Use the annotations in UniprotKB to determine molecular functions.
Molecular functions are assigned according to the classification of the Gene Ontology Project, version 2017-09-30. It recognized 15 categories numerically coded as shown below:
Antioxidant: 1
Binding: 2
Catalytic: 3
Hijacked molecular function: 4
Molecular carrier activity: 5
Molecular function regulator: 6
Molecular transducer activity: 7
Nutrient reservoir activity: 8
Protein tag: 9
Signal transducer activity: 10
Structural molecule activity: 11
Toxin activity: 12
Transcription regulator activity: 13
Translation regulator activity: 14
Transporter activity: 15
It should be noted UniprotKB quite often annotates an entry with multiple functions. Only the core function is recognized in the modeling. For example, an S-adenosylmethionine transferase would be annotated with 3 molecular functions: ATP binding, Mg binding and methionine adenosyltransferase activity. Its molecular function will simply be reported as “catalytic” and coded 3 as shown above.
Gather the following nine parameters (predictor variables) related to your proteins:
1.The total number of amino acid residues (r) for the end-product of expression
2.The corresponding isoelectric point (pI)
3.The GRAVY hydropathicity index (hp)
4.The instability index (ii)
5.The tertiary structural class index (t)
6.The membrane association status (m)
7.The quaternary structural class index (q)
8.The predicted number of N-glycosylation sites (ng)
9.The molecular function (mf)
Here’s how to collect them:
Items 1-4 (r, pI, hp and ii) could be obtained by submitting your sequence to the ProtParam tool of the ExPASy server.
The remaining items 5 (t), 6 (m), 7 (q) and 9 (mf) can only be accurately determined if the exact sequence could be identified from protein databases (e.g., NCBI Reference Proteins (Refseq protein), UniprotKB/Swiss-Pro (swissprot), PDB). By default, a homolog/ortholog with a high degree of amino acid sequence identity (>50%) or homology (>80%) could be used as substitute. Once the relevant sequence has been identified in the database, search for the corresponding annotations, then numerically code the information as follows:
Item 5, tertiary structural classes (t):
The tertiary structural classes (t) could be determined from annotations of the relevant PDB entries when available, or alternatively by direct visualization using the 3DView module in Mol*. When an experimentally determined structure is not available the tertiary structural class could be obtained from a predicted structure for the exact protein sequence, or by default the closest homolog or ortholog, obtained from the AlphaFold Protein Structure Database. The classification of tertiary structural classes was based on SCOP with minor modifications. Five classes are recognized and numerically coded 1 to 5, respectively:
Small protein (sp): 1
All alpha (α): 2
All beta (β): 3
Alternating alpha beta (α/β): 4
Complex protein (α+β) with multiple domains containing any combinations of α, β, α/β, and on rare occasion sp: 5
Item 6, membrane association status (m):
Three states of membrane association are represented and numerically coded 0,1,2, respectively:
Soluble with no known membrane association: 0
An nth order transmembrane protein: 1
A protein without transmembrane helices but annotated as directly associated with membrane, including via post translational modification with lipid anchors: 2
In the absence of adequate annotations, membrane association could be determined by prediction of transmembrane helices using TMHMM and Phobius, by prediction of lipid anchorage sites using GPS-Lipid (forS-Palmitoylation, N-Myristoylation, S-Farnesylation and S-Geranylgeranylation) and big-PI Predictor (for GPI anchor), and by comprehensive subcellular localization using Deeploc.
Item 7, quaternary structural classes (m):
Use the annotations in the PDB and UniprotKB for quaternary structural classification (q). Five categories are recognized and coded 1-5, respectively:
Monomer: 1
Homopolymer of nth order: 2
Homopolymer aggregate of undefined size: 3
Heteropolymer of nth order: 4
Heteropolymer aggregate of undefined size: 5
Item 8 (ng) could be obtained by submitting the sequence (including signal peptide) to the NetNGlyc-1.0 server of DTU Health Tech.
Item 9 (mf), molecular function (mf):
Use the annotations in UniprotKB to determine molecular functions.
Molecular functions are assigned according to the classification of the Gene Ontology Project, version 2017-09-30. It recognized 15 categories numerically coded as shown below:
Antioxidant: 1
Binding: 2
Catalytic: 3
Hijacked molecular function: 4
Molecular carrier activity: 5
Molecular function regulator: 6
Molecular transducer activity: 7
Nutrient reservoir activity: 8
Protein tag: 9
Signal transducer activity: 10
Structural molecule activity: 11
Toxin activity: 12
Transcription regulator activity: 13
Translation regulator activity: 14
Transporter activity: 15
It should be noted UniprotKB quite often annotates an entry with multiple functions. Only the core function is recognized in the modeling. For example, an S-adenosylmethionine transferase would be annotated with 3 molecular functions: ATP binding, Mg binding and methionine adenosyltransferase activity. Its molecular function will simply be reported as “catalytic” and coded 3 as shown above.
Reference
Forecasting host cells for recombinant protein expression (4/18/2022) Hung V. Le (Open access manuscript)
Help
Go to the inquiry page to submit comments, questions, requests for help.
|