To get started, from the WebSTR homepage, pick a genome assembly version in the dropdown. WebSTR currently stores some datasets that are only mapped to the hg19/GRCh37 coordinates. The default value is hg38 and corresponds to Ensembl version GRCh38.p2. You can search for STRs by entering one of the following into the search bar:
A valid search will take you to a region-level page. The top of this page displays the exon/intron structure of genes in the region. Arrows next to the gene names represent the template strand of the gene. Each dot represents an STR in the region, color-coded by the motif length (gray=homopolymer, red=dinucleotide, gold=trinucleotide, blue=tetranucleotide, green=pentanucleotide, and purple=hexanucleotide). Hovering over an STR will display the coordinates and motif. Clicking an STR will take you to the STR page for that locus.
The bottom of the page displays a table of all STRs identified in the region. It includes the coordinates, motif sequence, and length of the repeat in the reference genome.
STR pages (e.g. see example) provide locus-level information gathered from various genome-wide studies. By default, the sequence of the STR (red) and its genomic context (black, +/- 120bp) are shown. Other data panels (Expression STRs, Mutation and constraint, and STR imputation) can be displayed or hidden by clicking on the respective black title boxes. These panels are described below.
STR pages feature datasets from the following studies:
This STR reference panel is based on the GRCh38 reference assembly and contains 1.7 million unique autosomal STRs based on a combined set of TRs genotyped by four separate methods (HipSTR, GangSTR, ExpansionHunter, and AdVNTR) on the 1000 Genomes Project and H3Africa data.
For this reference panel we used statistical framework TRAL to find STRs in the human reference genome. The Sinergia-CRC repeats have been genotyped using GangSTR on more than 400 genomes from patients with colorectal cancer available to us through the TCGA consortium. This project is part of a larger effort "Trans-omic approach to colorectal cancer: an integrative computational and clinical perspective" funded by a SNSF Sinergia grant.
In this study we analyzed STRs in whole genome sequencing data from the Genotype Tissue Expression (GTEx) Project for 650 individuals and gene expression across 17 tissues to detect STRs whose lengths are correlated with expression of nearby genes (termed "eSTRs"). We further used CAVIAR to fine-map associations for individual genes against nearby SNPs to identify eSTRs most likely acting as causal variants. The figure below shows a schematic of the study design.
For each STR, all significant eSTRs (per-tissue gene-level FDR of 10%) are shown. The following statistics are given for each eSTR association:
In this study we formulated a novel STR mutation model based on a mean-centered random walk and used this model to estimate key parameters of STR mutation at individual loci. We then used this model to compute per-STR estimates of mutational constraint by comparing observed to expected mutation rates at each STR. These constraint metrics can be used to prioritize potentially pathogenic variants. For example, mutations at highly constrained STRs (way lower observed mutation rate than expected) may be indicative of a pathogenic mutation.
The following data is shown for each STR. Notably, constraint scores were only computed for STRs with motif lengths 2 and 4, and some loci for which optimization of our likelihood model failed no mutation information is available.
We generated a reference haplotype panel for imputing STR genotypes into SNP genotypes (either from WGS or from SNP arrays). The panel is based on WGS for quad families from the Simons Simplex Collection, which consists of individuals from various ancestry groups but is majority European and thus will have better performance for similar ancestry cohorts.
When available, the imputation metrics described below are available for each STR. These metrics give an overall picture of how well imputation of each STR will work across various ancestry groups.
Locus-level imputation metrics:
At each STR, we evaluated imputation performance by comparing genotypes obtained by HipSTR directly from WGS data vs. those obtained by imputation using our panel. Statistics shown are based on two types of evaluations. First, we performed a within-sample leave one out analysis (labeled as SSC). Second, we compared imputed vs. HipSTR genotypes in an orthogonal set of samples than those used to generate the panel (1000 Genomes, European, African, and East Asian cohorts). For each analysis, the following statistics are given:
We provide programmatic assess to the data using a RESTful API, documentation on available endpoints with code examples is available on the main page: http://webstr-api.ucsd.edu/docs
First version of WebSTR was made by Richard Yanicky and Melissa Gymrek with input from other Gymrek Lab members. It was originally inspired by the Exome Aggregation Database (ExAC). Current version of the website, database and the API is developed in collaboration with Maria Anisimova's Lab.
This collaboration project was supported by the SNSF Sinergia grant CRSII5_193832 and the EU Horizon 2020 research and innovation program under the Marie Skłodowska-Curie grant agreement No. 823886. Development of the first version of WebSTR was supported in part by the Office Of The Director, National Institutes of Health under Award Number DP5OD024577 and by SFARI Explorer Award Number 515568. Hosting, maintenance and development of WebSTR is partially funded by the NIH/NHGRI grant R01HG010885.