HMM

We use HMMs to detect phosphatase domains and accessory domains. We benefit from the HMMs in public database such as Pfam, and the sequence alignments from public database such as CDD, SMART and COG, which are useful to build HMMs. We also build HMMs from scratch.

List of HMMs of phosphatase domains

AP_AP: in-house HMM to find and determine the boundary of alkaline phosphatases

CC1_DSP: in-house HMM to find and determine the boundary of DSPs

CC1_Myotubularin: in-house HMM to find and determine the boundary of myotubularins

CC1_OCA: in-house HMM to find and determine the boundary of OCAs

CC1_PTEN: in-house HMM to find and determine the boundary of PTENs

CC1_PTP: in-house HMM to find and determine the boundary of PTPs

CC1_Paladin: in-house HMM to find and determine the boundary of Paladins

CC1_Sac: in-house HMM to find and determine the boundary of SACs

CC2_LMWPTP: in-house HMM to find and determine the boundary of LMWPTPs

CC2_SSU72: in-house HMM to find and determine the boundary of SSU72s

HAD_EYA: in-house HMM to find and determine the boundary of EYAs

HAD_FCP: in-house HMM to find and determine the boundary of FCPs

HAD_NagD: in-house HMM to find and determine the boundary of NagDs

HP_HP1: in-house HMM to find and determine the boundary of HP1s

HP_HP2: in-house HMM to find and determine the boundary of HP2s

PHP_PHP: in-house HMM to find and determine the boundary of PHPs

PPM_PPM: in-house HMM to find and determine the boundary of PPMs

PPPL_PAP: in-house HMM to find and determine the boundary of PAPs

PPPL_PPPc: in-house HMM to find and determine the boundary of PPPc

RTR1_RTR1: in-house HMM to find and determine the boundary of RTR1s

Rhodanese_CDC25: in-house HMM to find and determine the boundary of CDC25s

List of HMMs of accessory domains

PAP_NTD: in-house HMM of Purple Acid Phosphatase, N-Terminal Domain

CDC25_NTD: in-house HMM of CDC25, N-terminal domain

IQ: built from SMART alignment

PPIP5K_RimK

STS_UBA

MTMR5_C1: in-house HMM of MTMR5, C1 domain

MTMR_GRAM: in-house HMM of myotubularin, GRAM domain

PP2C_C: Pfam HMM specifically matches with PPP1C but not other PPP subfamilies

The Pfam PP2C_C profile only match to PPP1C subfamily, but not other PPPc subfamilies. It overlaps with our in-house PPPc HMM profile.

Guidance for HMM building

We usually built the HMMs from PSI-BLAST hits.

To find the domain sequences for building a HMM, we PSI-BLASTed the domain sequence or the full sequence usually against protein NR dataset via NCBI BLAST server. It sometime matters if you query the region that is supposed to contain the domain (based upon structure or any evidence) or the full sequence. The full sequence is often more sensitive to find weak hits to the domain. We recommended to download the files of Alignment, Search Strategies, and PssmWithParameters of PSI-BLAST result for reproductivity.

After several rounds of PSI-BLAST, we download the sequences of the aligned regions (not the complete sequences) from PSI-BLAST result. Because some sequences are redundant, which are not useful to build the HMM profile, we create the non-redundant sequence data set by using program CD-HIT (usually with sequence identity threshold as 70%, i.e. the parameter -c is set as 0.7).

We then carry out multiple sequence alignment (MSA) using programs such as MUSCLE, manually adjust the alignment usually by removing low-quality region in MSA editor such as JalView. We further inspect the distribution of sequence lengths in the MSA and remove the sequences which are shorter than most sequences in the MSA. How we remove the short sequences is dependent on the distribution and the MSA itself, which varies case by case.

We carry out MSA program and manually adjust the resulted MSA again after remove the short sequences. Then, we build HMM using program HMMBUILD. Depending on the format you use, you may need to convert the MSA into STOCKHOLM format before running HMMBUILD.

HMM

Contents