(in bioinformatics) The process of matching up base sequences (e.g. of genes) or amino acid sequences (of proteins) to reveal similarities and differences between them. It enables researchers to compare, for example, a newly sequenced gene or protein fragment with well-characterized sequences and is a key step in identifying the nature, possible function, and evolutionary relationships of novel genes and proteins. Alignment is performed by any of various computer programs and makes use of the vast amount of sequence data stored on public databases, which can be accessed via the Internet. In a pairwise alignment, just two sequences are compared, whereas multiple sequence alignment compares three or more. The program compares the sequences and computes the best alignment(s), allowing for gaps and mismatches. There are two main types of sequence alignment—global and local. Global alignment extends over entire sequences of a gene, genomic region, or protein and tends to be used to highlight mismatches in otherwise similar sequences, such as between homologous proteins. This can provide information about the evolutionary relationships of the organisms from which the sequences were obtained. Local alignment extends over relatively short sequences and pinpoints regions in which a novel gene or protein sequence (the query sequence) is similar to sequences of genes or proteins whose structure and function are well described. This can provide clues about structural features of the novel protein, such as DNA-binding or protein-binding domains, and hence its possible function. It also helps to identify similarities due to homology with proteins of other organisms. The distinction between local and global alignment strategies is arbitrary, and the two strategies are complementary.
The degree of similarity of alignments is assessed quantitatively by giving each match, mismatch, or gap a score. So, for example, positive scores may be given for matching bases at a given position, whereas mismatches or gaps are penalized with a negative score. Scoring of protein sequence alignments is complicated by the differences in physical and chemical properties of constituent amino acids. A mismatch involving substitution by an amino acid with similar properties (e.g. arginine for lysine) is unlikely to affect the functional properties of the protein and so is penalized very lightly, whereas one in which the mismatched amino acid has radically different properties, and hence is potentially of functional significance for the protein (e.g. alanine for tryptophan), is penalized heavily. The score for every possible mismatch is listed on a scoring matrix designed for use with the alignment tool. The overall score provides a means of assessing the biological significance of the alignment. If there is a very low probability that such an alignment score could be obtained purely by chance, then the two sequences are likely to share some meaningful property, such as a functional protein domain or sequence homology derived from a common ancestor. See also blast.