Proteomics: An Overview and Analysis of Current and Emerging Technologies, Competitive Landscape and the Market Environment
Found this at bio.com. Looks like a teaser to purchase the full report @ ~$5K. Couldn't find any threatening reproduction warnings, so here it is....
Proteomics aims to directly study the role and function of proteins in tissues and cells. The ultimate goal is to study the interaction of multiple proteins in healthy, diseased, and experimental conditions. Such global studies will enable the investigator to understand the holistic effects of a particular therapeutic agent or experimental intervention. To this end, proteomics requires the ability to separate and isolate all the constituents of a proteome. These pure or near homogeneous protein isolates must be detectable and in a form conducive to analysis.
All proteomics studies fall into the following [three] general categories: differential protein display (including changes in quantity), protein characterization (including post-translational modifications), and protein-protein interaction (including activity assays). All of these demand careful isolation of tissues or cells and proper sample preparation at the onset of the study. The correct choice for starting material, and proper preparation of it, can dictate the success of the whole procedure, because most studies rely on either purity of the sample or valid comparisons (untainted) between two samples.
The very first step is to choose and catalogue the proper crude samples. In basic research, the investigator has great discretion over this manner. For example, in differential protein display between two in vitro tissue culture samples, the treated and untreated cells are easily distinguishable. The experimental and control samples can be chosen based on their identical, or at least, similar pedigree to minimize spurious and irrelevant differences between the two. The only distinctions, at least theoretically, in the protein profile of the two samples will be due to the known intervention or therapeutic. Therefore, the selection of starting material in such a scenario is simple. However, this is not the case for most applications, even in basic science. The target cells are often intermixed with a variety of other cell types or unafflicted cells of the same kind. All proteomics protocols demand the separation of the cells of interest from the rest of the tissue. Otherwise, differential protein display, protein characterization, and protein-protein interaction studies will be tainted with the constituents of the background tissue.
Fluorescence-activated cell sorter (FACS) can be used to separate a subset of suspension cells form the rest of the population. This is especially useful in the study of certain hematopoetic-derived diseases. For example, there are known markers for cancerous leukemia and lymphoma cells that can be used to isolate the afflicted population and compare its protein profile to that of normal leukocytes. This technique can be used to isolate metastatic cells that are in transit in the circulation, also. However, detection of cells that have metastasized out of their tissue into the circulation requires great sensitivity.
Laser capture micro dissection (LCM) can be used to manually isolate a small number of adherent cells, however, this technique is most suitable for isolation of cells grown in vitro. For clinical studies, micro dissection of solid tumors can separate the malignant cells from the normal tissue. This technique is often reserved for solid tumors with clear demarcation of the tumor boundary. However, recent work has shown that LCM can be used to enrich clinical samples such as epithelial tissue from a human cervix specimen. Normal and malignant cells were microdissected from normal and malignant tissue from one cervix (hysterectomy specimen) and the protein profile of these cells was compared.
Samples can be compared between patients, also. This approach relieves the demand for separation of the afflicted and healthy cells, but complicates the analysis by introducing the differences between individuals. Comparing samples from different healthy individuals yields a number of polymorphism that is independent of any particular pathology. Therefore, in comparing samples from healthy controls and patients, large number of samples from many individuals must be compared to reveal only those differences that are related to the condition of interest.
Care in sample isolation must be taken in studies that are focused on protein characterization and protein-protein interaction, also. For example, the post-translational modification of a particular protein and its subsequent binding to other polypeptides may be altered in a condition, but this change may be very subtle and easily masked by contamination from the healthy surrounding tissue.
Once the correct sample is isolated, the cells or tissue of interest must be efficiently disrupted and the contents of the cells solubilized completely in order to obtain a sample representative of the whole protein population. Physical disruption techniques such as sonication, homogenization, shear-induced lysis, and rapid pressure change lysis are often used to open the cells prior to protein extraction. Lysis buffers containing detergents, protease inhibitors, and reducing and chaotropic agents facilitate the solubilization of the proteins and increase the stability of the polypeptides. For each cell or tissue type and for each specific application, a specific protocol is often needed to maximize the recovery of cellular proteins.
The extent of recovery of membrane and cytoskeletal proteins can be variable, leaving approximately 10% of the proteins in the insoluble pellet after extraction. This is specially important in studies seeking to understand proteins that reside in compartments that are hard to solubilize. Often, these proteins have evolved hydrophobic domains that facilitates their localization to these subcellular compartments and confer their biological activity. Therefore, these hydrophobic protein, by their very nature, are difficult to solubilize and retain in the profiling procedure.
The lysed sample is often clarified by centrifugation at speeds that pellet large membrane particles and DNA. The presence of nucleic acids, especially DNA, has severe detrimental effects on the separation of proteins by 2-D gel electrophoresis. Under denaturing conditions, such as those used to lyse and homogenize the sample, DNA samples are dissociated and cause a marked increase in the viscosity of the solution. This inhibits protein entry into the gel and retards its migration. Furthermore, DNA binds proteins and causes artifactual migration patterns and streaking. There are two methods for removing DNA from the sample. Endonuclease, which degrades DNA down to individual nucleotides, can be added to the sample. This is an attractive method for DNA removal, because it consist of one quick step that requires very little handling. Another method for DNA removal uses ampholytes that form complexes with DNA molecules. The resulting complexes are subsequently removed by centrifugation. This method carries the risk of losing some proteins that can interact with the anionic DNA molecules, hence, this must be done at high pH (where most proteins are anionic themselves) which may not be compatible with some protocols.
Before loading the clarified sample on a 2-D gel, some prefractionation can be done to study a particular fraction more carefully. Furthermore, for larger organisms it is impossible to separate the proteome into individual spots on one 2-D gel. Therefore, it is useful to split the original sample between multiple 2-D gels. This approach enhances the sensitivity and resolution of the gels and reveals more information about the subcellular location of the separated proteins.
There are multiple methods for pre-fractionating the sample before loading on the 2-D gel. Traditional biochemical separation procedures can be used to isolate subcellular fractions or organelles. These protocols often rely on separating nuclei and unbroken cells from cytoplasmic organelles by differential sedimentation at low centrifugal forces. The remaining supernatant is subjected to various density gradients to isolate specific organelles such as mitochondria or lysosomes. Not only do these prefractionations improve the resolution of the 2-D gels, but also they reveal whether a particular protein is mislocalized in a certain malady.
Preparative liquid isoelectric focusing is another method for pre-fractionating the original sample. It is noteworthy that although the principles of separation are the same as those used in the first direction of 2-D gel electrophoresis, preparative isoelectric focusing aims to fractionate the sample into defined pH ranges and not to isolate each protein. This technique concentrates all the proteins of similar isoelectric point into specific fractions that can be separated from one another on a 2-D gel with a narrow pH gradient.
Affinity chromatography is a more selective tool for prefractionating the sample. Columns using ion exchange, antibodies, or heparin can be used to separate the starting material into smaller and more homogeneous fractions. For example, anti-phosphotyrosine antibodies can be used to isolate all proteins that are phosphorylated on one or more tyrosine residues. In instances where both peptide motifs in an intermolecular interaction are known, one of the two motifs can be coupled to the stationary phase of the column and used as bait to retard all cellular proteins that contain the other motif.
These are just a few examples of all the possible prefractionation techniques, and many others exist. They all serve to maximize the sensitivity and resolution of the separation protocol while simultaneously maximizing the information yield of the entire procedure. Once the sample has been fractionated to the desired level, it is often loaded onto a 2-D gel electrophoresis apparatus. Although 2-D gel electrophoresis is not the only method for separation of proteins, it is currently the most reliable technique for doing so rapidly, in large scale, and in parallel. Two-dimensional gel electrophoresis separates proteins based on their isoelectric point in the first dimension. This is followed by separation of proteins based on size in the second dimension. (see section below for more detail)
After running the 2-D gel in both directions, the contents of the gel are often transferred electrophoretically onto a membrane for later analysis. Membranes are more robust and compatible with automation. Proteins are also more stable on membranes and more easily manipulated before analysis. For example, proteins can be electroblotted through a membrane which has a protease covalently bound to it and the resulting peptides are trapped on a second hydrophobic membrane. Such a method allows for automated digestion of multiple proteins without contamination and excessive handling. Furthermore, the final hydrophobic membrane can be used directly to analyze the resulting peptides in instruments such as MALDI TOF mass spectrometry. However, electroblotting large 2-D gels is difficult and often nonquantitative due to the different transfer properties of the proteins and their binding affinity to the membrane.
Whether the gel is transferred to a membrane or not, the separated proteins (spots) must be detected. A variety of detection agents such as Coomassie blue, silver, and SYPRO are available. The ideal compound is one that detects all proteins with similar affinity and at very low levels. Coomassie blue is the most consistent, silver staining is the most sensitive, and fluorescent agents such as SYPRO tend to be more compatible with the automation of spot excision.
After the gel has been visualized by staining, it must be scanned and digitized for storage and downstream manipulation. Each spot on the 2-D gel must be defined based on saturation thresholds and defined spot boundaries. The position of each spot is of limited use unless it is in reference to known markers that were included in the loaded sample. The location of the known markers must be identified, and a warping equation developed to extrapolate the properties of the proteins in each spot. The quality of the image can be enhanced using software to filter out the background and enhance contrast. A reference gel is chosen and a few best-matched spots are used to compare the migration patterns of the gels. Triangulation using the aforementioned information can extend the matching of spots between gels. The differences can be analyzed based on changes in the intensity of spots between the gels. The spots of interest can be excised for later manipulation or all the samples can be treated in the gel. For example, the excised sample or the whole gel can be digested with a protease in preparation for peptide mass fingerprinting.
It is noteworthy that each spot in a 2-D gel is rarely made of one species of protein. Due to the large number of expressed proteins in most cells and tissues, and the current limitations of the technology, it is unlikely that many proteins migrate into distinct positions on 2-D gels. Loading of gels to their very limit to allow for the detection of low-copy proteins, contaminants that cause streaking, and the very nature of many proteins, causes a great deal of over-lapping migration. Therefore, it can be beneficial to fractionate the "separated" samples after 2-D gel electrophoresis. The best approach is to couple a capillary separation technique directly to the final analysis instrument, the mass spectrometer. Capillary zone electrophoresis, high pressure liquid chromatography, and gas chromatography can separate the constituents of each spot into individual proteins and peptides that can be analyzed on a mass spectrometer more easily. However, the advent of more sensitive mass spectrometers with high resolution and tandem mass spectrometry, has reduced the need for post 2-D gel fractionation.
Although multiple tools for analysis are available, mass spectrometry (MS) is the most widely used and the method that holds the most promise for large scale studies. Mass spectrometry technology is discussed in more detail later in this report. This section only seeks to highlight the role of MS in the context of the whole process. Initially, data is automatically accumulated and the appropriate spectra selected for data extraction. The first level is very rapid, and serves to identify the bulk of the proteins in the sample. This is followed by a more methodical scan to select the ions of interest for further analysis. At this point the mass-to-charge ratio of the proteins can be deciphered. This can be translated to molecular weight using standards and in conjunction with information obtained from the 2- gel (such as isoelectric point), it can be used to narrow the identification process. At this point a variety of protocols can be used to obtain more information about the N-terminal amino acid sequence of the proteins or their susceptibility to digestion by known proteases. These studies often require a tandem mass spectrometer, which can analyze the parent ion on the first detector and analyze its daughter ions on a second detector. Such approaches are more time consuming but yield high-information content. Post-translational modifications can be studied by either the systematic digestion or removal of the modifications, or by daughter ion scanning.
At the conclusion of this procedure, most of the proteins and peptides are separated into near pure fractions and analyzed. The information about each spot or each peptide in a particular spot is usually limited at first. It often consists of one or more of the following: a peptide mass fingerprint, a short N-terminal sequence, amino acid composition, molecular weight, or isoelectric point. This information is then compared to the sequence of all putative proteins that are encoded by the genome of that organism. Recent advances in software technology allow for predicting the behavior of these virtual proteins under experimental conditions. Therefore, the experimental data can be compared to the theoretical information to find a match. Here lies the connection between proteomics and genomics, because it is rarely possible or economical to sequence proteins in their entirety.
Since identification of proteins is heavily dependent on genomics databases, it is appropriate to review the progress of genome projects in various species. The automation of high throughput DNA sequencing has changed the landscape of biological research. The traditional approach of characterizing a single gene at a time has given way to systematically identifying the information content of an organism. The genome of the simple bacteriophage _-X 174 was the first to be completely sequenced in the late 1970s. Haemophilus influenzae was the first free-living organism to be sequenced in its entirety in 1995. The genome of Escherichia coli was sequenced in 1997, and now many prokaryotes have been fully characterized. Amongst these organisms are a number of human pathogens such as Helicobacter pylori associated with ulcers and gastritis, and Rickettsia prowazekii which causes typhus.
Genome projects have been completed in eukaryotic organisms, also. The genome of the first unicellular eukaryote, the yeast Saccharomyces cerevisae, was fully sequenced in 1996, and a multicellular eukaryote, the nematode worm Caenorhabditis elegans was completely sequenced in late 1998. The genome projects of the fruit fly Drosophila melanogaster, the plant Arabidopsis thaliana, and of several other organisms are near completion. In 1998, ongoing genome projects were reported on over 70 prokaryotic organisms. There are also over 20 reported programs to sequence eukaryotic model organisms.
Since a large fraction of eukaryotic DNA does not code for protein, the initial efforts in the genome projects for larger eukaryotic organisms have been primarily focused on sequencing segments of the genome that code for proteins (genes). Therefore, a great deal of information is available about potential coding sequences of eukaryotes. This is specially true for human genes which account for nearly three quarters of all expressed sequence tags (ESTs). ESTs are short sequences obtained from random priming from human cDNA libraries. Sequences near the 5' end or the 3' poly(A) tail of the cDNA are amplified and sequenced. Using the sequence information of the coding region, primers can be designed that will allow for the mapping of that particular gene on a chromosome. Effectively, an EST database is a catalogue of all potential genes in that particular organism. Although, EST information can be beneficial to identification of proteins only if the short EST sequence overlaps with the partial sequence of the peptide, these databases are already very potent tools for linking proteins to their genes.
Public human EST databases contain over 1,200,000 entries, 350,000 of which are considered to be non-redundant. The power of EST databases is best illustrated by noting that although only 3% of the human genome was sequenced by the end of 1998, over 50% of all human genes are thought to be represented in public EST databases. It is likely that private EST databases cover as many as 80% of all human genes. Furthermore, considering that the human genome is expected to contain about 100,000 genes, some of the 350,000 unique ESTs representing 50% of all human genes must belong to different parts of the same gene. Therefore, a larger fraction of the coding sequences in the human genome is covered by 250-400 base pair stretches of multiple expressed sequence tags to different parts of the same genes. This coverage makes identification of proteins based on partial peptide sequencing and EST information more probable.
At this point a positive match between the experimental sample and the putative protein can be used to identify the protein in that sample. However, in organisms with incomplete genome projects, a positive hit often fails to eliminate the possibility that a yet unidentified gene codes for the isolated protein. The criteria used to find a match are rarely comprehensive, therefore, it is possible for two proteins to satisfy certain criteria equally well, because they share some homology over one or two domains. These proteins can be very different overall, but match the same gene based on limited-criteria matching. Of course, the absence of a match in such an organism will also fail to reveal any information about the protein of interest. Currently, the only method of studying such a protein is through traditional protocols such as partial peptide sequencing and screening of cDNA libraries for the corresponding DNA code.
The importance of bioinformatics has become ever more obvious as proteomic and genomic studies have become more efficient in data generation. This is especially true for high throughput protein analysis, where the quantity of data is beyond manual deciphering. In the case of HPLC or capillary zone electrophoresis (CZE) coupled tandem mass spectrometry, the resulting data files are enormous and very little can be interpreted by manual viewing.
===========================================================
In the interest of fairness, here's a link to a plug for their full report. bio.com |