Recombinant Host Cell Prediction

Recombinant Protein Expression
Host Cell Prediction

Enter these values into the respective cells to calculate logit(p), odd of expression (p/(1-p)), probability of expression, and the preference rank order normalized to E. coli.

Enter Values
Residue Number:
Isoelectric Point:
Hydropathicity:
Instability Index:
Tertiary Structural Class:
Membrane Association:
Quaternary Structural Class:
N-Glycosylation Sites:
Molecular Function:

Results

	logit(p)	Odd (p/(1-p))	p	Rank order vs E. coli
E. coli Expression
Insect Cells Expression
Mammalian Cells Expression
Yeast Cells Expression

Introduction

Selection of an appropriate host cell is most critical when planning for recombinant protein expression. Common considerations include expression efficiency, laboratory experience in vector/host constructions, prior literature, time, and cost considerations. The desired results are seldom instantly guaranteed and many optimization trials at great labor cost and reagents are the norm. The prediction offered here leverages the collective experience in recombinant protein expression since inception in the late 1970’s to build predictive models for host cell selection. Evidence-based statistical methods are applied to relate structure-function parameters, stability index, subcellular localization, and post translational modifications, to preference for expression in certain host cell types. The resulting logits models provide a rational approach for forecasting the preference of specific protein sequences for expression in four host cell types: Escherichia coli, insect, mammalian, and yeast cells. Identifying the correct host in the initial stage will minimize or eliminate the need for further downstream optimizations using computational tools or costly laboratory trials.

Instructions

Identify the amino acid sequence for expression and set it in one-letter codes, preferably in FASTA format. Ideally, you would want to know the entire coding sequence including N-terminal Met, signal peptide, and internal processing sites if present. Familiarity with protein databases and various prediction webservers would also be required to execute this prediction.

Gather the following nine parameters (predictor variables) related to your proteins:

1.The total number of amino acid residues (r) for the end-product of expression
2.The corresponding isoelectric point (pI)
3.The GRAVY hydropathicity index (hp)
4.The instability index (ii)
5.The tertiary structural class index (t)
6.The membrane association status (m)
7.The quaternary structural class index (q)
8.The predicted number of N-glycosylation sites (ng)
9.The molecular function (mf)

Here’s how to collect them:

Items 1-4 (r, pI, hp and ii) could be obtained by submitting your sequence to the ProtParam tool of the ExPASy server.

The remaining items 5 (t), 6 (m), 7 (q) and 9 (mf) can only be accurately determined if the exact sequence could be identified from protein databases (e.g., NCBI Reference Proteins (Refseq protein), UniprotKB/Swiss-Pro (swissprot), PDB). By default, a homolog/ortholog with a high degree of amino acid sequence identity (>50%) or homology (>80%) could be used as substitute. Once the relevant sequence has been identified in the database, search for the corresponding annotations, then numerically code the information as follows:

Item 5, tertiary structural classes (t):

The tertiary structural classes (t) could be determined from annotations of the relevant PDB entries when available, or alternatively by direct visualization using the 3DView module in Mol*. When an experimentally determined structure is not available the tertiary structural class could be obtained from a predicted structure for the exact protein sequence, or by default the closest homolog or ortholog, obtained from the AlphaFold Protein Structure Database. The classification of tertiary structural classes was based on SCOP with minor modifications. Five classes are recognized and numerically coded 1 to 5, respectively:

Small protein (sp): 1
All alpha (α): 2
All beta (β): 3
Alternating alpha beta (α/β): 4
Complex protein (α+β) with multiple domains containing any combinations of α, β, α/β, and on rare occasion sp: 5

Item 6, membrane association status (m):

Three states of membrane association are represented and numerically coded 0,1,2, respectively:

Soluble with no known membrane association: 0
An nth order transmembrane protein: 1
A protein without transmembrane helices but annotated as directly associated with membrane, including via post translational modification with lipid anchors: 2

In the absence of adequate annotations, membrane association could be determined by prediction of transmembrane helices using TMHMM and Phobius, by prediction of lipid anchorage sites using GPS-Lipid (forS-Palmitoylation, N-Myristoylation, S-Farnesylation and S-Geranylgeranylation) and big-PI Predictor (for GPI anchor), and by comprehensive subcellular localization using Deeploc.

Item 7, quaternary structural classes (m):

Use the annotations in the PDB and UniprotKB for quaternary structural classification (q). Five categories are recognized and coded 1-5, respectively:

Monomer: 1
Homopolymer of nth order: 2
Homopolymer aggregate of undefined size: 3
Heteropolymer of nth order: 4
Heteropolymer aggregate of undefined size: 5

Item 8 (ng) could be obtained by submitting the sequence (including signal peptide) to the NetNGlyc-1.0 server of DTU Health Tech.

Item 9 (mf), molecular function (mf):

Use the annotations in UniprotKB to determine molecular functions.

Molecular functions are assigned according to the classification of the Gene Ontology Project, version 2017-09-30. It recognized 15 categories numerically coded as shown below:

Antioxidant: 1
Binding: 2
Catalytic: 3
Hijacked molecular function: 4
Molecular carrier activity: 5
Molecular function regulator: 6
Molecular transducer activity: 7
Nutrient reservoir activity: 8
Protein tag: 9
Signal transducer activity: 10
Structural molecule activity: 11
Toxin activity: 12
Transcription regulator activity: 13
Translation regulator activity: 14
Transporter activity: 15

It should be noted UniprotKB quite often annotates an entry with multiple functions. Only the core function is recognized in the modeling. For example, an S-adenosylmethionine transferase would be annotated with 3 molecular functions: ATP binding, Mg binding and methionine adenosyltransferase activity. Its molecular function will simply be reported as “catalytic” and coded 3 as shown above.

Reference

Forecasting host cells for recombinant protein expression (4/18/2022) Hung V. Le (Open access manuscript)

Help

Go to the inquiry page to submit comments, questions, requests for help.

Recombinant Protein ExpressionHost Cell Prediction

​Introduction

Instructions

Reference

​Help

Recombinant Protein Expression
Host Cell Prediction

Introduction

Help