Difference between revisions of "HMM"
(→HMMs of accessory domains) |
(→Guidance for HMM building) |
||
Line 40: | Line 40: | ||
We usually built the HMMs from PSI-BLAST hits. | We usually built the HMMs from PSI-BLAST hits. | ||
− | To find the domain sequences for building a HMM, we PSI-BLASTed the domain sequence or the full sequence usually against protein NR dataset via NCBI BLAST server. It sometime matters if you query the region that is supposed to contain the domain (based upon structure or any evidence) or the full sequence. The full sequence is often more sensitive to find weak hits to the domain. We recommended to download the files of Alignment, Search Strategies, and PssmWithParameters of PSI-BLAST result for reproductivity. | + | To find the domain sequences for building a HMM, we PSI-BLASTed the domain sequence or the full sequence usually against protein NR dataset via NCBI BLAST server. It sometime matters if you query the region that is supposed to contain the domain (based upon structure or any evidence) or the full sequence. The full sequence is often more sensitive to find weak hits to the domain. We recommended to download the files of Alignment, Search Strategies, and PssmWithParameters of PSI-BLAST result for reproductivity. The boundaries of the domain are determined by crystal structures, usually using the boundaries described in the papers reported the structures. |
After several rounds of PSI-BLAST, we download the sequences of the aligned regions (not the complete sequences) from PSI-BLAST result. Because some sequences are redundant, which are not useful to build the HMM profile, we create the non-redundant sequence data set by using program [http://weizhong-lab.ucsd.edu/cdhit_suite/cgi-bin/index.cgi CD-HIT] (usually with sequence identity threshold as 70%, i.e. the parameter -c is set as 0.7). | After several rounds of PSI-BLAST, we download the sequences of the aligned regions (not the complete sequences) from PSI-BLAST result. Because some sequences are redundant, which are not useful to build the HMM profile, we create the non-redundant sequence data set by using program [http://weizhong-lab.ucsd.edu/cdhit_suite/cgi-bin/index.cgi CD-HIT] (usually with sequence identity threshold as 70%, i.e. the parameter -c is set as 0.7). |
Revision as of 16:47, 6 October 2015
We use HMMs to detect phosphatase domains and accessory domains. We benefit from the HMMs in public database such as Pfam, and the sequence alignments from public database such as CDD, SMART and COG, which are useful to build HMMs. We also build HMMs from scratch.
Contents
HMMs for Determining the Boundaries of Protein Phosphatase Domain
- AP_AP: in-house HMM to find and determine the boundary of alkaline phosphatases
- CC1_DSP: in-house HMM to find and determine the boundary of DSPs
- CC1_Myotubularin: in-house HMM to find and determine the boundary of myotubularins
- CC1_OCA: in-house HMM to find and determine the boundary of OCAs
- CC1_PTEN: in-house HMM to find and determine the boundary of PTENs
- CC1_PTP: in-house HMM to find and determine the boundary of PTPs
- CC1_Paladin: in-house HMM to find and determine the boundary of Paladins
- CC1_Sac: in-house HMM to find and determine the boundary of SACs
- CC2_LMWPTP: in-house HMM to find and determine the boundary of LMWPTPs
- CC2_SSU72: in-house HMM to find and determine the boundary of SSU72s
- HAD_EYA: in-house HMM to find and determine the boundary of EYAs
- HAD_FCP: in-house HMM to find and determine the boundary of FCPs
- HAD_NagD: in-house HMM to find and determine the boundary of NagDs
- HP_HP1: in-house HMM to find and determine the boundary of HP1s
- HP_HP2: in-house HMM to find and determine the boundary of HP2s
- PHP_PHP: in-house HMM to find and determine the boundary of PHPs
- PPM_PPM: in-house HMM to find and determine the boundary of PPMs
- PPPL_PAP: in-house HMM to find and determine the boundary of PAPs
- PPPL_PPPc: in-house HMM to find and determine the boundary of PPPc
- RTR1_RTR1: in-house HMM to find and determine the boundary of RTR1s
- Rhodanese_CDC25: in-house HMM to find and determine the boundary of CDC25s
HMMs for Finding Protein Phosphatase Domain with high coverage
We have built a semi-redundant HMM profiles to detect phosphatase domains from biological sequences. We first used the HMMs from public databases, Pfam, SMART and SUPERFAMILY to search protein phosphatases we collected from the literature. We found 1) the HMMs from public databases such as Pfam and SMART can not find all human protein phosphatases, 2) some HMMs are redundant, - they captured exactly the same set of protein phosphatases. Thus, we 1) build in-house HMMs to capture the human protein phosphatases missed, 2) remove 100% redundant HMMs. You can download the HMMs.
HMMs of accessory domains
- PAP_NTD: in-house HMM of Purple Acid Phosphatase, N-Terminal Domain
- CDC25_NTD: in-house HMM of CDC25, N-terminal domain
- IQ: built from SMART alignment
- PPIP5K_RimK
- STS_UBA
- MTMR5_C1: in-house HMM of MTMR5, C1 domain
- MTMR_GRAM: in-house HMM of myotubularin, GRAM domain
- PP2C_C: Pfam HMM specifically matches with PPP1C but not other PPP subfamilies. The Pfam PP2C_C profile only match to PPP1C subfamily, but not other PPPc subfamilies. It overlaps with our in-house PPPc HMM profile.
Guidance for HMM building
We usually built the HMMs from PSI-BLAST hits.
To find the domain sequences for building a HMM, we PSI-BLASTed the domain sequence or the full sequence usually against protein NR dataset via NCBI BLAST server. It sometime matters if you query the region that is supposed to contain the domain (based upon structure or any evidence) or the full sequence. The full sequence is often more sensitive to find weak hits to the domain. We recommended to download the files of Alignment, Search Strategies, and PssmWithParameters of PSI-BLAST result for reproductivity. The boundaries of the domain are determined by crystal structures, usually using the boundaries described in the papers reported the structures.
After several rounds of PSI-BLAST, we download the sequences of the aligned regions (not the complete sequences) from PSI-BLAST result. Because some sequences are redundant, which are not useful to build the HMM profile, we create the non-redundant sequence data set by using program CD-HIT (usually with sequence identity threshold as 70%, i.e. the parameter -c is set as 0.7).
We then carry out multiple sequence alignment (MSA) using programs such as MUSCLE, manually adjust the alignment usually by removing low-quality region in MSA editor such as JalView. We further inspect the distribution of sequence lengths in the MSA and remove the sequences which are shorter than most sequences in the MSA. How we remove the short sequences is dependent on the distribution and the MSA itself, which varies case by case.
We carry out MSA program and manually adjust the resulted MSA again after remove the short sequences. Then, we build HMM using program HMMBUILD. Depending on the format you use, you may need to convert the MSA into STOCKHOLM format before running HMMBUILD.