Manual of the HTP database

Preparation of benchmark sets

The TOPDB database has been split into two parts; the first contains entries, which have known 3D structure, while the second set contains entries with topologies confirmed only by molecular biology experiments. Entries, whose reliability is above 99% and 95% for bitopic and polytopic transmembrane proteins were selected, respectively. For each sequence in the human proteome, BLAST searching was done against these two sets. The resulting hits were aligned with the query sequences using HSPs, and those were kept, which

  • had a sequence similarity above 40%,
  • the overlapping sequences covered all TM helices of the TOPDB entry, and
  • the length of the hit sequence was above 80% of the length of the query sequence.
Finally, we have filtered these sets by the CD-HIT algorithm to 40% similarity. This resulted in 134 sequences, which homologous partner's structure is known ("3D benchmark set"), and in 333 sequences, which homologous partner contain only experimental topology data ("experimental benchmark set").