Difference between revisions of "HMM"
(→Guidance for HMM building) |
(→HMMs of accessory domains) |
||
(29 intermediate revisions by 2 users not shown) | |||
Line 2: | Line 2: | ||
== HMMs for Determining the Boundaries of Protein Phosphatase Domain == | == HMMs for Determining the Boundaries of Protein Phosphatase Domain == | ||
− | * | + | * CC1 Fold |
− | * [[HMM_PD0002| | + | ** [[HMM_PD0002|DSP]] |
− | * [[HMM_PD0003| | + | ** [[HMM_PD0003|Myotubularin]] |
− | * [[HMM_PD0004| | + | ** [[HMM_PD0004|OCA]] |
− | * [[HMM_PD0005| | + | ** [[HMM_PD0005|PTEN]] |
− | * [[HMM_PD0006| | + | ** [[HMM_PD0006|PTP]] |
− | * [[HMM_PD0007| | + | ** [[HMM_PD0007|Paladin]] |
− | * [[HMM_PD0008| | + | ** [[HMM_PD0008|Sac]] |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | == HMMs for Finding Protein Phosphatase Domain with | + | * CC2 fold |
+ | ** [[HMM_PD0009|LMWPTP]] | ||
+ | ** [[HMM_PD0010|SSU72]] | ||
+ | |||
+ | * [[HMM_PD0021|CC3/Rhodanese/CDC25]] | ||
+ | |||
+ | * HAD fold | ||
+ | ** [[HMM_PD0011|EYA]] | ||
+ | ** [[HMM_PD0012|FCP]] | ||
+ | ** [[HMM_PD0013|NagD]] | ||
+ | |||
+ | * HP fold | ||
+ | ** [[HMM_PD0014|HP1]] | ||
+ | ** [[HMM_PD0015|HP2]] | ||
+ | |||
+ | * PPP fold | ||
+ | ** [[HMM_PD0018|PAP]] | ||
+ | ** [[HMM_PD0019|PPP]] | ||
+ | |||
+ | * [[HMM_PD0001|AP fold]] | ||
+ | |||
+ | * [[HMM_PD0016|PHP fold]] | ||
+ | |||
+ | * [[HMM_PD0017|PPM fold]] | ||
+ | |||
+ | * [[HMM_PD0020|RTR1 fold]] | ||
+ | |||
+ | == HMMs for Finding Protein Phosphatase Domain with High Coverage == | ||
We have built a semi-redundant HMM profiles to detect phosphatase domains from biological sequences. We first used the HMMs from public databases, Pfam, SMART and SUPERFAMILY to search protein phosphatases we collected from the literature. We found 1) the HMMs from public databases such as Pfam and SMART can not find all human protein phosphatases, 2) some HMMs are redundant, - they captured exactly the same set of protein phosphatases. Thus, we 1) build in-house HMMs to capture the human protein phosphatases missed, 2) remove 100% redundant HMMs. You can [http://phosphatome.net/download/ download the HMMs]. | We have built a semi-redundant HMM profiles to detect phosphatase domains from biological sequences. We first used the HMMs from public databases, Pfam, SMART and SUPERFAMILY to search protein phosphatases we collected from the literature. We found 1) the HMMs from public databases such as Pfam and SMART can not find all human protein phosphatases, 2) some HMMs are redundant, - they captured exactly the same set of protein phosphatases. Thus, we 1) build in-house HMMs to capture the human protein phosphatases missed, 2) remove 100% redundant HMMs. You can [http://phosphatome.net/download/ download the HMMs]. | ||
+ | |||
+ | * [[HMM_PD00156|Vertebrate PTP]]: built from the alignment of 195 vertebrate PTP sequences at [http://ptp.cshl.edu/downloads.shtml ptp.cshl]. | ||
== HMMs of accessory domains == | == HMMs of accessory domains == | ||
− | * [[ | + | * [[HMM_CA00001|PH domain]] |
− | * [[HMM_PD0128|CDC25_NTD]]: | + | ** [[HMM_PD0134|MTMR_GRAM]]: myotubularin, GRAM domain (in-house) |
− | * [[HMM_PD0129|IQ]]: built from SMART alignment | + | *** [[HMM_PD00135|MTMR_1_GRAM]]: GRAM domain profile of MTMR1, MTMR5, MTMR6, MTMR9 and MTMR10 subfamilies (in-house) |
+ | *** [[HMM_PD00136|MTMR3_GRAM]]: MTMR3, GRAM domain (in-house) | ||
+ | *** [[HMM_PD00152|MTMR9_GRAM]]: MTMR9, GRAM domain (built from CDD alignment) | ||
+ | *** [[HMM_PD00137|MTMR14_GRAM]]: MTMR14, GRAM domain (in-house) | ||
+ | ** [[HMM_PD0149|PH_1]]: PH domain (built from CDD alignment) | ||
+ | |||
+ | * FCP1 C-Terminal Domain: FCP1 has a C-terminal domain which is conserved in individual clades such as vertebrates and arthropods. However, it is hard to built a universal profile to capture all the FCP1 CTD. The best guess is that the region contains several short motifs. | ||
+ | ** [[HMM_PD00155|FCP1_CTD]]: FCP1, C-Terminal Domain (in-house). The profile aims at to detect the presence of FCP1_C domain but not the boundaries. The profile is able to detect FCP1 CTD in eumetazoa, such as nematostella, fruit fly, some but not all nematodes (e.g. Loa loa), sea urchin, Ciona and human. | ||
+ | |||
+ | * [[HMM_PD00154|VSP_VSD]]: VSP, Voltage Sensor Domain (in-house) | ||
+ | |||
+ | * [[HMM_PD0127|PAP_NTD]]: Purple Acid Phosphatase, N-Terminal Domain (in-house) | ||
+ | |||
+ | * [[HMM_PD0128|CDC25_NTD]]: CDC25, N-terminal domain (in-house) | ||
+ | |||
+ | * [[HMM_PD0129|IQ]]: IQ profile built from SMART alignment | ||
+ | |||
* [[HMM_PD0130|PPIP5K_RimK]] | * [[HMM_PD0130|PPIP5K_RimK]] | ||
+ | |||
* [[HMM_PD0131|STS_UBA]] | * [[HMM_PD0131|STS_UBA]] | ||
− | * [[HMM_PD0133|MTMR5_C1]]: in-house | + | |
− | * [[ | + | * [[HMM_PD0133|MTMR5_C1]]: MTMR5, C1 domain (in-house) |
− | * [http://pfam.xfam.org/family/PF07830.9 PP2C_C]: Pfam HMM specifically matches with PPP1C but not other PPP subfamilies. The Pfam PP2C_C profile only | + | |
+ | * [[HMM_PD0139|SacN]]: Sac N-terminal domain (in-house) | ||
+ | |||
+ | * [[HMM_PD0144|IPPc]]: Inositol Polyphosphate Phosphatase, catalytic domain homologues (built from SMART alignment) | ||
+ | |||
+ | * [[HMM_PD0145|SAC9_CTD1]]: SAC9 C-terminal domain 1. The domain is mostly found in plants, green algae, and amoebazoa. (in-house) | ||
+ | |||
+ | * [[HMM_PD0146|SAC9_CTD2]]: SAC9 C-terminal domain 2. The domain is mostly found in plants, green algae, and amoebazoa. (in-house) | ||
+ | |||
+ | * [[HMM_PD0147|WW]] (built from SMART alignment) | ||
+ | |||
+ | * [[HMM_PD0148|SSH_NTD]]: Slingshot, N-terminal domain (built from CDD alignment) | ||
+ | |||
+ | * [[HMM_PD00151|DnaJ_1]]: DnaJ domain (built from SMART alignment) | ||
+ | |||
+ | * [[HMM_PD00171|RA]]: Ras Association (RA) | ||
+ | |||
+ | == HMMs partially overlap with phosphatase domains == | ||
+ | Some domains and their HMM profiles partially over with phosphatase domains. We do not consider them as accessory domains. | ||
+ | |||
+ | * [http://pfam.xfam.org/family/PF07830.9 PP2C_C]: Pfam HMM specifically matches with PPP1C but not other PPP subfamilies. The Pfam PP2C_C profile only matches the PPP1C subfamily. It overlaps with our in-house PPP HMM profile. | ||
+ | |||
+ | * [http://pfam.xfam.org/family/3-PAP#tabview=tab1 3-PAP]: The Pfam 3-PAP domain has poor quality. As described on the domain "this domain family is found in eukaryotes, and is typically between 115 and 138 amino acids in length". But, it has a length of 132 aa in HMM logo. The profile partially overlaps with our profile of myotubularin phosphatase domain, whose boundaries are determined from crystal structure. | ||
== Guidance for HMM building == | == Guidance for HMM building == | ||
We usually built the HMMs from PSI-BLAST hits. | We usually built the HMMs from PSI-BLAST hits. | ||
− | To find the domain sequences for building a HMM, we PSI-BLASTed the domain sequence or the full sequence usually against protein NR dataset via NCBI BLAST server. It sometime matters if you query the region that is supposed to contain the domain (based upon structure or any evidence) or the full sequence. The full sequence is often more sensitive to find weak hits to the domain. We recommended to download the files of Alignment, Search Strategies, and PssmWithParameters of PSI-BLAST result for reproductivity. The boundaries of the domain are determined by crystal structures, usually using the boundaries described in the papers reported the structures. | + | To find the domain sequences for building a HMM, we PSI-BLASTed the domain sequence or the full sequence usually against protein NR/RefSeq/Swiss-Prot dataset via NCBI BLAST server. It sometime matters if you query the region that is supposed to contain the domain (based upon structure or any evidence) or the full sequence. The full sequence is often more sensitive to find weak hits to the domain. We recommended to download the files of Alignment, Search Strategies, and PssmWithParameters of PSI-BLAST result for reproductivity. The boundaries of the domain are determined by crystal structures, usually using the boundaries described in the papers reported the structures. |
After several rounds of PSI-BLAST, we download the sequences of the aligned regions (not the complete sequences) from PSI-BLAST result. Because some sequences are redundant, which are not useful to build the HMM profile, we create the non-redundant sequence data set by using program [http://weizhong-lab.ucsd.edu/cdhit_suite/cgi-bin/index.cgi CD-HIT] (usually with sequence identity threshold as 70%, i.e. the parameter -c is set as 0.7). | After several rounds of PSI-BLAST, we download the sequences of the aligned regions (not the complete sequences) from PSI-BLAST result. Because some sequences are redundant, which are not useful to build the HMM profile, we create the non-redundant sequence data set by using program [http://weizhong-lab.ucsd.edu/cdhit_suite/cgi-bin/index.cgi CD-HIT] (usually with sequence identity threshold as 70%, i.e. the parameter -c is set as 0.7). | ||
Line 47: | Line 102: | ||
We carry out MSA program and manually adjust the resulted MSA again after remove the short sequences. Then, we build HMM using program [http://hmmer.janelia.org HMMBUILD]. Depending on the format you use, you may need to [http://sequenceconversion.bugaco.com/converter/biology/sequences/fasta_to_stockholm.php convert the MSA into STOCKHOLM format] before running HMMBUILD. | We carry out MSA program and manually adjust the resulted MSA again after remove the short sequences. Then, we build HMM using program [http://hmmer.janelia.org HMMBUILD]. Depending on the format you use, you may need to [http://sequenceconversion.bugaco.com/converter/biology/sequences/fasta_to_stockholm.php convert the MSA into STOCKHOLM format] before running HMMBUILD. | ||
+ | |||
+ | Note: The conserved regions (determined by sequence similarity) could be longer or shorter than the domains observed in crystal structures. |
Latest revision as of 18:26, 15 March 2017
We use HMMs to detect phosphatase domains and accessory domains. We benefit from the HMMs in public database such as Pfam, and the sequence alignments from public database such as CDD, SMART and COG, which are useful to build HMMs. We also build HMMs from scratch.
Contents
HMMs for Determining the Boundaries of Protein Phosphatase Domain
HMMs for Finding Protein Phosphatase Domain with High Coverage
We have built a semi-redundant HMM profiles to detect phosphatase domains from biological sequences. We first used the HMMs from public databases, Pfam, SMART and SUPERFAMILY to search protein phosphatases we collected from the literature. We found 1) the HMMs from public databases such as Pfam and SMART can not find all human protein phosphatases, 2) some HMMs are redundant, - they captured exactly the same set of protein phosphatases. Thus, we 1) build in-house HMMs to capture the human protein phosphatases missed, 2) remove 100% redundant HMMs. You can download the HMMs.
- Vertebrate PTP: built from the alignment of 195 vertebrate PTP sequences at ptp.cshl.
HMMs of accessory domains
- PH domain
- MTMR_GRAM: myotubularin, GRAM domain (in-house)
- MTMR_1_GRAM: GRAM domain profile of MTMR1, MTMR5, MTMR6, MTMR9 and MTMR10 subfamilies (in-house)
- MTMR3_GRAM: MTMR3, GRAM domain (in-house)
- MTMR9_GRAM: MTMR9, GRAM domain (built from CDD alignment)
- MTMR14_GRAM: MTMR14, GRAM domain (in-house)
- PH_1: PH domain (built from CDD alignment)
- MTMR_GRAM: myotubularin, GRAM domain (in-house)
- FCP1 C-Terminal Domain: FCP1 has a C-terminal domain which is conserved in individual clades such as vertebrates and arthropods. However, it is hard to built a universal profile to capture all the FCP1 CTD. The best guess is that the region contains several short motifs.
- FCP1_CTD: FCP1, C-Terminal Domain (in-house). The profile aims at to detect the presence of FCP1_C domain but not the boundaries. The profile is able to detect FCP1 CTD in eumetazoa, such as nematostella, fruit fly, some but not all nematodes (e.g. Loa loa), sea urchin, Ciona and human.
- VSP_VSD: VSP, Voltage Sensor Domain (in-house)
- PAP_NTD: Purple Acid Phosphatase, N-Terminal Domain (in-house)
- CDC25_NTD: CDC25, N-terminal domain (in-house)
- IQ: IQ profile built from SMART alignment
- MTMR5_C1: MTMR5, C1 domain (in-house)
- SacN: Sac N-terminal domain (in-house)
- IPPc: Inositol Polyphosphate Phosphatase, catalytic domain homologues (built from SMART alignment)
- SAC9_CTD1: SAC9 C-terminal domain 1. The domain is mostly found in plants, green algae, and amoebazoa. (in-house)
- SAC9_CTD2: SAC9 C-terminal domain 2. The domain is mostly found in plants, green algae, and amoebazoa. (in-house)
- WW (built from SMART alignment)
- SSH_NTD: Slingshot, N-terminal domain (built from CDD alignment)
- DnaJ_1: DnaJ domain (built from SMART alignment)
- RA: Ras Association (RA)
HMMs partially overlap with phosphatase domains
Some domains and their HMM profiles partially over with phosphatase domains. We do not consider them as accessory domains.
- PP2C_C: Pfam HMM specifically matches with PPP1C but not other PPP subfamilies. The Pfam PP2C_C profile only matches the PPP1C subfamily. It overlaps with our in-house PPP HMM profile.
- 3-PAP: The Pfam 3-PAP domain has poor quality. As described on the domain "this domain family is found in eukaryotes, and is typically between 115 and 138 amino acids in length". But, it has a length of 132 aa in HMM logo. The profile partially overlaps with our profile of myotubularin phosphatase domain, whose boundaries are determined from crystal structure.
Guidance for HMM building
We usually built the HMMs from PSI-BLAST hits.
To find the domain sequences for building a HMM, we PSI-BLASTed the domain sequence or the full sequence usually against protein NR/RefSeq/Swiss-Prot dataset via NCBI BLAST server. It sometime matters if you query the region that is supposed to contain the domain (based upon structure or any evidence) or the full sequence. The full sequence is often more sensitive to find weak hits to the domain. We recommended to download the files of Alignment, Search Strategies, and PssmWithParameters of PSI-BLAST result for reproductivity. The boundaries of the domain are determined by crystal structures, usually using the boundaries described in the papers reported the structures.
After several rounds of PSI-BLAST, we download the sequences of the aligned regions (not the complete sequences) from PSI-BLAST result. Because some sequences are redundant, which are not useful to build the HMM profile, we create the non-redundant sequence data set by using program CD-HIT (usually with sequence identity threshold as 70%, i.e. the parameter -c is set as 0.7).
We then carry out multiple sequence alignment (MSA) using programs such as MUSCLE, manually adjust the alignment usually by removing low-quality region in MSA editor such as JalView. We further inspect the distribution of sequence lengths in the MSA and remove the sequences which are shorter than most sequences in the MSA. How we remove the short sequences is dependent on the distribution and the MSA itself, which varies case by case.
We carry out MSA program and manually adjust the resulted MSA again after remove the short sequences. Then, we build HMM using program HMMBUILD. Depending on the format you use, you may need to convert the MSA into STOCKHOLM format before running HMMBUILD.
Note: The conserved regions (determined by sequence similarity) could be longer or shorter than the domains observed in crystal structures.