Statistical analysis (currently given the misnomer artificial intelligence) of full genome sequencing reveals Autism is caused by changes in non-coding DNA (often referred to as 'junk DNA') - princeton.edu
Most previous research on the genetic basis of disease has focused on the 20,000 known genes and the surrounding sections of DNA that regulate those genes. However, even this enormous amount of genetic information makes up only slightly more than 1% of the 3.2 billion chemical pairs in the human genome. The other 99% has conventionally been thought of as “dark” or “junk,” although recent research has begun to disrupt that idea.
In their new finding, the research team offers a method to make sense of this vast array of genomic data. The system uses an artificial intelligence technique called deep learning in which an algorithm performs successive layers of analysis to learn about patterns that would otherwise be impossible to discern. In this case, the algorithm teaches itself how to identify biologically relevant sections of DNA and predicts whether those snippets play a role in any of more than 2,000 protein interactions that are known to affect the regulation of genes. The system also predicts whether disrupting a single pair of DNA units would have a substantial effect on those protein interactions.
The algorithm “slides along the genome” analyzing every single chemical pair in the context of the 1,000 chemical pairs around it, until it has scanned all mutations, Troyanskaya said. The system can thus predict the effect of mutating each and every chemical unit in the entire genome. In the end, it reveals a prioritized list of DNA sequences that are likely to regulate genes and mutations that are likely to interfere with that regulation.
Prior to this computational achievement, the conventional way to glean such information would be painstaking laboratory experiments on each sequence and each possible mutation in that sequence. This number of possible functions and mutations is too big to contemplate — an experimental approach would require testing each mutation against more than 2,000 types of protein interactions and repeating those experiments over and over across tissues and cell types, amounting to hundreds of millions of experiments. Other research groups have sought to accelerate this discovery by applying machine learning to targeted sections of DNA, but had not achieved the ability to look at each DNA unit and each possible mutation and the effects on each of more than 2,000 regulatory interactions across the whole genome.
“What our paper really allows you to do is take all those possibilities and rank them,” said Park. “That prioritization itself is very useful, because now you can also go ahead and do the experiments in just the highest priority cases.”
Lastly, the system calibrates its based on known disease-causing mutations and develops a “disease impact score,” an assessment of how likely a given mutation is to have an effect on disease.
In the case of autism, the researchers analyzed the genomes of 1,790 families with “simplex” autism spectrum disorder, meaning the condition is apparent in one child but not in other members of the family. (These data were taken from the Simons Simplex Collection of more than 2,000 autism families.) Among this sample, fewer than 30% of the people affected by autism spectrum disorder had a previously identified genetic cause. The newly found mutations are likely to significantly increase that fraction, the researchers said.
The ability to predict the functional effect of each mutation was the key innovation in this new study. Previous studies had found it challenging to detect any difference in the number of regulatory mutations in people with autism compared to unaffected people. The new method, however, looked at mutations predicted to have a high functional impact, and found a significantly higher number of such mutations in affected people.
When the researchers then looked at what genes were affected by these mutations, they turned out to be genes strongly associated with brain functions. These newly discovered mutations affected similar genes and functions as do previously identified mutations.
“Now we open the field to understand all the factors that may be involved in autism,” said Theesfeld.
This information also is important to families and their doctors to better diagnose the disorder and to avoid making overly general assumptions how one person’s autism might be classified with others. “They say that when you meet one person with autism you have met one person with autism because no cases are alike,” said Theesfeld. “Genetically, it seems to be the same way.”
With this new method, the team is analyzing the genetic causes of various forms of cancer, heart disease and other disorders.
The work was funded by the NIH and Simons Foundation. Other co-authors of the study include Aaron Wong, Julien Funk and Kevin Yao of Flatiron, and Yuan Yuan, Claudia Scheckel, John Fak and Yoko Tajima of Rockefeller. . .
An integrative analysis of non-coding regulatory DNA variations associated with autism spectrum disorder - nature.com
ABSTRACT
A number of genetic studies have identified rare protein-coding DNA variations associated with autism spectrum disorder (ASD), a neurodevelopmental disorder with significant genetic etiology and heterogeneity. In contrast, the contributions of functional, regulatory genetic variations that occur in the extensive non-protein-coding regions of the genome remain poorly understood. Here we developed a genome-wide analysis to identify the rare single nucleotide variants (SNVs) that occur in non-coding regions and determined the regulatory function and evolutionary conservation of these variants. Using publicly available datasets and computational predictions, we identified SNVs within putative regulatory regions in promoters, transcription factor binding sites, and microRNA genes and their target sites. Overall, we found that the regulatory variants in ASD cases were enriched in ASD-risk genes and genes involved in fetal neurodevelopment. As with previously reported coding mutations, we found an enrichment of the regulatory variants associated with dysregulation of neurodevelopmental and synaptic signaling pathways. Among these were several rare inherited SNVs found in the mature sequence of microRNAs predicted to affect the regulation of ASD-risk genes. We show a paternally inherited miR-873-5p variant with altered binding affinity for several risk-genes including NRXN2 and CNTNAP2 putatively overlay maternally inherited loss-of-function coding variations in NRXN1 and CNTNAP2 to likely increase the genetic liability in an idiopathic ASD case. Our analysis pipeline provides a new resource for identifying loss-of-function regulatory DNA variations that may contribute to the genetic etiology of complex disorders.
Introduction
Autism spectrum disorder (ASD) is a neurodevelopmental disorder with heterogeneous genetic origins. Recent genome sequencing studies have identified many risk genes from the loss-of-function protein-coding variants, which has driven a move toward analysis of convergent risk pathways [ 1, 2, 3, 4]. However, less is known about the contribution to risk by the variants in non-coding regulatory regions, which hold the potential to disrupt the finely tuned biological pathways involved in brain development, as demonstrated by a recent study in the influence of the 3' untranslated regions (3'UTR)-regulatory variants in language impairment [ 5]. Nevertheless, progress is being made by using whole-genome sequencing of the ASD families. In a recent landmark-autism study, non-coding de novo DNA variations were found to be enriched in the untranslated regions of genes, gene boundaries, and DNase I hypersensitive regions [ 6]. Similarly, a number of previous exome-sequencing studies including our own [ 7] have shown that there are many DNA variations that occur in the non-coding regulatory regions of genes, albeit of unknown functional significance.
The functional impact of the non-coding variants has been difficult to interpret due to the lack of specific knowledge concerning their contribution to the regulatory regions that control and modulate gene transcription. Even in cases where a variant in a non-coding region is found to be associated with a disorder through linkage analyses, it is a challenge to assess whether such variations have gene regulatory functions, or whether the risk lies elsewhere in the linkage region [ 8]. The problem is compounded with the de novo mutations, which are unlikely to be represented in public databases, despite their significant contribution to ASD [ 6, 7, 9, 10, 11]. Much of this difficulty arises from the lack of data integration and functional sequence analysis surrounding transcription control, the molecular interactions, and transactions, of which are poorly understood when compared to the classic set of rules (also known as genetic code) by which genetic information encodes the protein sequences. In this regard, the identification and contribution of the non-coding regulatory variants to ASD-associated biological pathways remains unknown.
The availability of large, genome-wide regulatory resources such as FANTOM5 [ 12], ENCODE [ 13], and the BrainSpan developmental brain tissue expression datasets [ 14] have been instrumental in developing modern-systems-biology-approaches to evaluate the regulatory loss-of-function mutations associated with human disorders. Moreover, these novel approaches involve predicting whether a variant has the potential to change the dynamics of gene networks by disrupting the interactions between key regulatory molecules, such as miRNAs and transcription factors, and their target functional pathways [ 8, 15, 16]. There is now an increased focus on ASD-associated mutations within the regulatory regions, particularly near-known ASD-risk genes [ 17, 18], with studies examining the enhancer regions [ 19] and microRNA targeting [ 5]. Together these data provide a vision of how we might assess the potential impact of non-coding regulatory DNA variations associated with ASD.
In this study, we developed a systems-based analysis to identify the non-coding regulatory variants using the previously published whole-exome sequencing (WES) data of the ASD families [ 7], and focusing on the microRNA genes and the gene-proximal cis-regulatory variants. Single nucleotide variants (SNVs) in the regulatory regions were filtered by rarity and functional score. We compiled a genome-wide resource of regulatory regions covering microRNA genes and their putative target sites, transcription start sites, and transcription-factor-binding sites (TFBS) in promoters, 5' and 3' UTRs, to evaluate the relationship of these regulatory SNVs to the synaptic pathways. We further characterized these regulatory mutations using ASD-related databases and functional network analyses to show that rare heterozygous loss-of-function coding and non-coding variations combine as the probable causal elements. We specifically highlight the functional significance of a SNV found in microRNA-873, which affects target binding affinity and regulation of the key ASD-risk genes.
Materials and methods
Annotation and filtering of whole-exome sequencing dataThe WES data used in this study was from our previous study of Australian ASD families (n?=?128; 48 ASD cases, and 80 parents) [ 7], and we considered the single nucleotide variants (SNVs) for this analysis as there were no large copy number variations (CNVs) detected in this sample (data not shown). The participants were recruited from the Western Australian Autism Biological Registry at the Telethon Kids Institute in Perth, Australia [ 20] and the blood samples were collected and analyzed with the approval by the Ethics Committee’s at the Princess Margaret Hospital for Children, University of Western Australia (1845/EP) and the University of Queensland (2012000269). Informed written consent was obtained from next of kin, caretakers, or guardians on behalf of the minors/children enrolled in our study. In this study, we have mapped our WES data (NimbleGen SeqCap version 3) to the hg19-reference genome with the BWA-MEM algorithm [ 21]. Duplicate reads were removed using Picard tools v2.60. Indel realignment and base quality score recalibration were performed using GATK v3.5 [ 22]. The germline SNVs and indels were called using the GATK HaplotypeCaller tool and only high quality SNVs were selected for further analysis (RMSMappingQuality, MQ?>?40; QualByDepth, QD?>?2). We further selected only SNVs at genomic positions covered by more than 20 reads in at least 90% of the 128 samples. Annotations were made using the ANNOVAR package (2017Jul16) [ 23] with additional databases (dbSNP150, gnomAD exome and genome collections, CADD version 1.3). The Genome Aggregation Database (gnomAD) was used to obtain minor allele frequencies (MAF) for all quality-passed SNVs (NFE or Non-Finnish European children were used as background population) [ 24].
As per the previous analysis of the coding variants in the same ASD cohort [ 7], we used a two-stage filtering process to identify the most interesting SNVs. First a Minor Allele Frequency filter (MAF?<?0.01) was applied to select the rare variants, which were thought to have a larger effect in complex disorders such as ASD [ 7, 25, 26], then a ‘functional’ score (CADD score?=?15) [ 27] was used to prioritize the potential loss-of-function SNVs.
Computational analysis of the regulatory regionsWe defined the regulatory SNV set as those situated in promoters, transcriptional start sites, enhancers, microRNA genes, putative microRNA regulatory elements (MREs) on the 3'UTRs, and TFBS on the promoters, 5'UTRs and 3'UTRs of the protein-coding genes.
Promoter regions were defined using FANTOM5 phase 1 TSS (transcription start site) CAGE data (Cap Analysis of Gene Expression) as regions 1000?bp upstream and 300?bp downstream of the annotated CAGE peaks for Human Genome Nomenclature Consortium (HGNC) gene records [ 12]. The enhancer region coordinates were downloaded from the FANTOM5 phase 1 database [ 28]. Genomic annotations of the non-coding regions are from the University of California Santa Cruz (UCSC) ‘knownGene’ annotation (downloaded 21 February 2013) on the hg19-human reference genome including all transcript variants.
For TFBS prediction, we built a database of 1289 DNA motifs collected from different studies: the FANTOM5 project [ 12], JASPAR [ 29], Cristino, Williams [ 30], and Jolma and Yan [ 31], as well as data from several stem-cell-focused transcription factor studies [ 32, 33, 34, 35, 36, 37, 38, 39, 40]. We searched for instances of these DNA motifs (at least 70% similarity) on 3'UTRs, 5'UTRs, and promoter regions of genes, using the TAMO package (Version 21 March 2012) [ 41].
We predicted MREs across 3'UTRs for all human microRNAs (miRBase version 19) [ 42]. The microRNA-target sites predicted by miRanda [ 43], (default parameters, free energy =?-18), having a 75% overlap with RNAhybrid [ 44] (-b 2000, -e -18, -s 3utr_human, then filtered free energy =?-25) were used for further analysis.
We also searched for SNVs falling within the microRNA-mature-sequence regions themselves. These variants were filtered by population frequency (MAF?<?0.01) and CADD score (CADD?=?15), and SNVs interpreted on the basis of the mature sequence (as annotated in miRbase19) and seed regions (2–8nt at 5'-end of mature sequence). To identify the conserved microRNA match motifs as a proxy for seed regions, we ran MEME (v4.9.0, options: -mod zoops -nmotifs 1-maxw 15-maxsize 10,000,000) over each microRNA’s predicted MREs [ 45]. Regions of motifs with low information density were trimmed out with TAMO (trimming threshold 0.2) [ 41]. To estimate the effects of the microRNA gene variants in terms of loss or gain of targets, we used miRanda [ 43] (default parameters, free energy <?=?-18) to predict the target sites for microRNA sequences incorporating their respective variants across all known 3'UTR sequences available in the TargetScan database v7.0 [ 46].
Enrichment analyses of the regulatory SNVsTo test for a general association of the regulatory SNVs with ASD, enrichment was calculated against several ASD-associated gene sets from different sources; the Simons Foundation Autism Research Initiative (SFARI) AutDB database (December 2015 update, Scores indicating any support: S,1–4) [ 47], our previously developed ASD protein-protein interaction network (AXAS-ASD), which uses ‘seed’ genes from public resources and their first-degree neighbors [ 30] and Module 13 (M13-Li2014)—a module of protein–protein interactions from Li, Shi [ 48]—which is associated with ASD risk. In addition three control datasets were retrieved from the GWAS catalog (https://www.ebi.ac.uk/gwas): (1) coronary artery disease (GWAS-CAD), (2) Crohn’s disease (GWAS-Crohn), (3) and autism spectrum disorders (GWAS-ASD).
We also calculated enrichment of the regulatory SNVs (via hypergeometric distribution one-tailed test in R) in co-expression modules constructed and characterized by Parikshak et al. [ 2].
Biotinylated microRNA-mRNA pulldown transcriptome analysisBiotinylated microRNA-mRNA pulldown experiments were used to identify bound mRNA transcripts of the miR-873-5p wild-type (WT) and the variant miR-873-5p (Mut). Briefly, biotinylated microRNAs were transfected into the undifferentiated SH-SY5Y cells (a human neuroblastoma cell line), and captured along with their targeted transcripts to magnetic streptavidin beads (Invitrogen Dynabeads M-280).
The pulldown protocol was primarily drawn from Wani and Cloonan [ 49], as well as Cristino and Barchuk [ 50], with the following modifications. The samples were transfected in a 75?cm [ 2] flask with 560 pmol of biotinylated microRNA with lipofectamine 2000 (Thermo Fisher), and grown for 24?h before pulldown. Four biological replicates, from four independent transfection and pulldown experiments, were made for each microRNA (WT and Mut). RNA was purified with RNAeasy columns (QiAGEN), and sequencing libraries were prepared from 100-ng input RNA with Illumina TruSeq Stranded mRNA kit. Libraries were sequenced across two lanes using v4 SBS chemistry on an Illumina HiSeq2000. Cells were grown in DMEM F12 with HEPES media with penicillin/streptomycin at 37?°C with 5% CO2. The synthetic microRNA duplexes with biotin tags on the 3' end of the mature microRNA were ordered from IDT (Supplementary Table 1).
The efficiency of transfection was checked quantitatively through staining. Then cells were fixed in 4% PFA, rinsed with PBS, and blocked with 2% BSA in 0.3% Triton X-100 (TX) for 30?min at room temperature for 2?h. The cells were incubated with Alexa Flour 594 streptavidin (Illumina) in BSA/PBS/TX mix for 1?h at room temperature on rotation, followed by 10?min of DAPI incubation.
Sequence data quality was examined with FastQC (v0.11.3). Illumina-adapter sequences were removed and reads quality trimmed with Trimmomatic [ 51]. Sequences were aligned to the human-GRCh38-Ensembl-release-83 transcriptome with TopHat2 [ 52]. The gene-level counts were generated from paired reads in HTseq-count (Parameters: -s reverse -m union). The differential expression analysis to evaluate the enrichment of targets in pulldown samples (paired comparisons), compared to matched whole-transcriptome of transfected cells (controls), were done using DEseq2 [ 53]. Gene ontology hypergeometric enrichment was calculated with GOStats (v2.40.0, parameters, conditional?=?FALSE) [ 54]. The sequence data is available via GEO accession: GSE98088.
Bioinformatics, data analysis tools, and structural modelingManipulation of sequence, annotation, and the SNV files was undertaken using VCFtools (v0.1.12a) [ 55], tabix (v0.2.6) [ 56], ANNOVAR (version: 2017Jul16) [ 23], SAMtools (v0.1.18) [ 57], and bedtools (v2.17.0) [ 58]. Cytoscape (v3) [ 59] was used for network visualization, and the ClueGO plugin (v2.3.2) [ 60] used for functional annotation enrichment. The crystal structure of the extracellular domain of NRXN 1 alpha was from Bos taurus (PDB: 3QCW [ 61]), and the resolution of the structure was used as the model was 2.65?Å. The Pro436Ser mutation was modeled using FoldX version 3.0b4 (http://foldxsuite.crg.eu/), with default settings.
Results
Annotation, and filtering the genetic variants in the regulatory regionsIn this study, we have used the previously published WES data [ 7] to investigate putative regulatory genetic variation in 128 individual genomes from an Australian ASD cohort (48 affected cases and 80 parents). To examine the captured non-coding regions, we developed a computational pipeline to systematically analyze WES data in non-coding regulatory contexts: proximal promoter regions, transcription start site (TSS), untranslated regions (5' and 3'UTRs), TFBS, MRE, and miRNA genes (Fig. 1).
Fig. 1
 An overview of the computational pipeline and databases used to identify the putative regulatory regions. Identification of the regulatory regions to check for variants in (1) promoters and transcription start site (TSS), (2) TFBS, (3) MREs, and (4) miRNA genes
Full size image
We generated overlapping predictions of TFBSs (4.8 million in 5'UTR/22.4 million in 3'UTR from 1289 DNA motifs) and MREs (2.6 million from 2042 mature miRNAs) across 59,133 untranslated regions of 19,414 genes, and used public databases of miRNA genes (miRbase v19) [ 42], transcription start sites and promoters (FANTOM5) [ 12] and enhancers (FANTOM5) [ 12] (Table 1).
Table 1 Summary of regulatory region predictions, exome-sequencing coverage, and SNVs in the regulatory regions Full size table
We assessed the WES coverage across the different non-coding regions (Supplementary Figure 1). There was sufficient read depth for the call-variants across 20–24% of all predicted TFBS and MREs in the untranslated regions (Table 1). Fortuitously, despite the limitations of exome capture design in the non-coding regions, the actual sequence coverage is highest in the regions proximal to exome-capture gene targets, where the most potentially interesting cis-regulatory elements would be expected (Supplementary Figure 1). The MiRNA genes were well-covered (48%), and although the promoters had less callable coverage (13%), they still yielded valuable data (Table 1). The enhancers had very low callable coverage (1%) due to their distance from the captured exonic regions, and were omitted from this analysis (Table 1).
After filtering the SNVs for rare population frequency (Minor Allele Frequency =1% based on gnomAD exome and genome collections) and functional annotation (CADD score =15), each individual case had an average of 288 protein-coding and 299 regulatory variations (Supplementary Figure 2; Supplementary Table 2). Greater numbers of the SNVs were identified in protein-coding regions and TFBSs in the promoter regions, followed by UTR TFBSs and 3'UTR MREs, reflecting the relative sizes of these regions (Table 1).
The genome-wide annotation of regulatory regions, and TFBS PWM (Position Weight Matrix) definitions in TAMO format [ 41] are available at figshare (https://doi.org/10.6084/m9.figshare.2198773.v1).
Association of the rare variants with the ASD-risk genesWe used a standardized score (Z-score) of the binomial distribution test to assess whether the genes with rare protein-coding and non-coding regulatory SNVs found in our ASD cohort were over-represented in the ASD-associated genes. In a previous study [ 7], we have shown that the rare protein-coding SNVs identified in our ASD cohort were significantly enriched in ASD-associated pathways (AXAS-ASD) [ 30]. Herein, we have modified our pipeline to identify the putative loss-of-function SNVs in both protein-coding and non-coding regions, and also to reduce filtering steps biases (MAF?<?0.01 and CADD score =15; Supplementary Table 2). Strikingly, the regulatory SNVs identified in ASD cases were significantly enriched in the ASD-associated datasets (AXAS-ASD, SFARI, M13-Li2014, and GWAS-ASD; Fig. 2 and Supplementary Table 3), but not in the control datasets—coronary artery disease (GWAS-CAD) and Crohn’s disease (GWAS-Crohn; Fig. 2 and Supplementary Table 3). Interestingly, the SNVs identified in the protein-coding regions showed the highest enrichment in SFARI gene list, while the SNVs in the non-coding regulatory regions are highly enriched in the AXAS-ASD-gene network (Fig. 2; Supplementary Table 3).
Fig. 2
 Enrichment of the filtered regulatory variants and the protein-coding variants from ASD cases in different ASD-associated datasets AXAS-ASD (n?=?2664) [ 30], SFARI (n?=?397) [ 47], M13 - Li 2014 (n?=?115) [ 48], and GWAS-ASD (ASD associated genes as described in the NHGRI-EBI GWAS Catalog). Two unrelated disease control datasets, coronary artery disease (GWAS-CAD), and Crohn’s disease (GWAS-Crohn) were also retrieved from the NHGRI-EBI GWAS Catalog. The bar graph shows –log10(P-values) for the Z-scores (Supplementary Table 3). Only genes found in the Human Genome Nomenclature Consortium (HGNC) database and the whole protein-protein interaction network [ 30] were considered as the background datasets used for Z-score calculations (Supplementary Table 3)
Full size image
Having seen an overall ASD association, we then examined whether the regulatory mutations were over-represented in particular pathways or regions during neurodevelopment. Parikshak et al. [ 2] have previously constructed and characterized a set of co-expressed gene modules with respect to ASD. We saw significant enrichment of genes having regulatory mutations in ASD cases against two modules associated with regulation of gene expression (M3) and early synaptic development (M16; Supplementary Table 4), at late fetal and early-postnatal timescales [ 2].
Effect of the rare microRNA variants in ASD-associated gene networksWe identified a total of four microRNAs with rare SNVs within their mature sequences, which could globally disrupt their mRNA binding specificity (Supplementary Table 5). Of these, one variant was in the conserved “seed” regions of the microRNAs, where there is greater potential for disruption due to the exact target complementarity required in canonical microRNA-mRNA binding (Supplementary Table 5). We hypothesize that miRNA mutations within the mature sequences will affect the microRNA-mRNA-binding affinity and change the regulation of several target genes. We found miR-873 gene had a SNV within the seed region of the mature miR-873-5p (Fig. 3a) sequence. Computational predictions of miRNA binding suggest that the mutant miR-873 loses 76% of the predicted target genes, while also gaining 74% potential new target genes. This is the greatest predicted loss- or gain-of-binding of any of the miRNA variants (Supplementary Table 5). In contrast, the impact of variants that lie in less-conserved positions outside the seed regions (e.g., miR-411-5p, miR-668-3p and miR-323b-5p; Supplementary Table 5) would be less disruptive.
Fig. 3
 Biological significance of the miR-873 variant. a miR-873-5p showing the variant in individual f34-s1, relative to the conserved seed motif generated from predicted miR-873-5p regulatory elements with MEME [ 45]. b Venn diagram showing the overlap between genes significantly enriched in the miR-875-5p wild-type (WT) and mutant (Mut)-pulldown assays. c Percentage of the SFARI genes in genes specifically enriched in the miR-875-5p wild-type (WT) (n?=?905) or mutant (n?=?925) pulldowns – genes enriched in both pulldowns are not included. P-value calculated with chi-squared test between the groups, p-value?=?0.044, chi-squared statistic?=?4.06, and degrees of freedom?=?1. Error bars represent 95% confidence interval. Only genes present in both pulldown assays at high enough levels to have been contrasted (p-value calculated) by DEseq2 in both WT and Mutant experiments were included, omitting three genes that were significantly enriched in Mut but uncertain in WT. d Rare protein-coding and regulatory variants found in case f34-s1 overlap the AXAS-ASD network [ 30] and the miR-873-5p target genes (shown as diamonds, see Supplementary Table 6 for more details). The overlapped network shown is enriched in functional pathways associated with synaptic vesicle cycle and synaptic transmission (Supplementary Table 7). e Family f34 case study showing direct interaction of several synaptic proteins in the neurexin-neuroligin axis, in the context of the maternally inherited putative loss-of-function NRXN1 coding and the paternally inherited miR-873-5p variants in individual f34-s1. Putative target genes of miR-873-5p are shown in blue. (*) SFARI genes
Full size image
We propose that this rare miR-873-5p seed mutation could be relevant to ASD-associated genes and pathways, due to miR-873’s expression and genomic context. The miR-873 gene is embedded in the intronic region of the LINGO2 gene (second intron of LINGO2 variant 2). Examples of LINGO2 copy number variation (CNV) have been seen previously in ASD cases [ 62, 63, 64, 65], including two instances where CNV knocks out the 5' region of LINGO2 variant 2 including the miR-873 gene [ 66, 67]. Furthermore, LINGO2 has been shown to exhibit a neural-specific expression pattern in early development in mice [ 68]. Although the host gene expression is not an absolute indicator of embedded microRNA context, it may be concordant, and miR-873-5p mature sequence itself has neuronal expression in the human brain (Supplementary Figure 3) [ 69].
Biotinylated microRNA-mRNA pulldown experiments were performed to identify the transcriptome-wide binding profiles of wild-type and mutant miR-873-5p (Fig. 3b) with 3226 and 3249 significantly enriched targeted genes identified respectively (Supplementary Table 6). By contrasting the relative pulldown assay enrichment of the wild-type and mutant miR-873-5p, we identified 2321 overlapping genes putatively targeted by both wild-type and mutant miR-873-5p, while 905 and 928 genes were specifically enriched in the wild-type and mutant microRNA-mRNA pulldown, respectively (Fig. 3b).
We assessed functional associations of the miR-873-5p target genes by calculating gene ontology enrichment of their significantly enriched genes (Supplementary Table 6). Notably, the target genes of both wild-type and mutant miR-873-5p were both significantly enriched in the synapse genes (p-value?=?2.88E-09). Moreover, 109 SFARI candidate genes (27%, SFARI categories 1-4,S) overlapped with miR-873-5p targets, identified in our miRNA-mRNA pulldown experiment (in either wild-type or mutant). Interestingly, the mutant-specific miR-873-5p targets are more enriched (compared to ‘wild-type’ miR-873-5p) with many of the ASD-risk genes documented in the SFARI database (chi-squared test, p-value?=?0.04, df?=?1) (Fig. 3c).
Members of the neurexin gene family have been implicated in cognitive disorders including schizophrenia and ASD [ 70]. Two neurexin-family members, NRXN2 and CNTNAP2 (also known as NRXN4), have been found significantly enriched in our pulldown assay. NRXN2 has been found significantly enriched in both wild-type (log2 fold-change?=?1.8, p-value?=?4.03E-50) and mutant miR-873-5p (log2 fold-change?=?1.1, p-value?=?1.61E-20). However, the mutation may have caused a partial loss-of-binding to NRXN2, since its enrichment levels are smaller in mutant miR-873-5p-pulldown data (1.5 fold-change decrease in enrichment). CNTNAP2 was found significantly enriched in mutant (log2 fold-change?=?0.75, p-value?=?2.5E-04) and wild-type miR-873-5p (log2 fold-change?=?0.58, p-value?=?0.03), but suggests partial gain-of-binding to CNTNAP2 for the miR-873-5p variant. Intriguingly, miR873’s host gene, LINGO2, is enriched in the wild-type miR-873-5p-pulldown assay (log2 fold-change?=?0.72, p-value?=?6.76E-05) but not in the mutant microRNA (log2 fold-change?=?0.14, p-value?=?0.49), indicating loss-of-binding to LINGO2 mRNA. Another relevant ASD-risk gene, the transcription factor FOXP1 [ 71], has also been found to be significantly enriched only in the mutant miR-873-5p pulldown assay (log2 fold-change?=?0.53, p-value?=?3.7E-05) indicating a complete gain-of-binding, specifically associated with the variant miR-873-5p (Supplementary Table 6).
Convergence of the rare regulatory and the protein-coding variants in synaptic pathwaysWe chose to follow-up on the effect of the SNV within miR-873-5p found in individual f34-s1 due to its location in the seed region (Fig. 3a), the microRNA’s robustly expressed profiles across all regions of the human brain (Supplementary Figure 3), and functional enrichment analysis of its predicted target genes pointing toward a role in the regulation of nervous system development (Fisher’s exact test with Bonferroni step down corrected P-value?=?0.002) and signaling (Fisher’s exact test with Bonferroni step down corrected P?=?0.0003) [ 60].
However, the miR-873-5p SNV is heterozygous and inherited from an unaffected father, so although not causative on its own, we hypothesized that it could contribute additively through interactions with other small-effect mutations inherited from parents, as well as de novo mutations. In our published whole-exome screen, the family members were assessed for a ‘broader autism phenotype’ (BAP), reflecting a subclinical expression of ASD phenotypes [ 7, 72]. Among parents of this proband, the mother was recorded as having a BAP [ 7]. In this study, we focused on the maternally inherited NRXN1 variant, as it is one of the most highly connected proteins in the case f34-s1-gene network (Fig. 3d) and a key ASD-risk gene involved in synaptic transmission [ 70]. Notably, in case f34-s1, few variant genes overlapped in both the AXAS-ASD and SFARI datasets and are well known players in the synaptic pathways (i.e., NRXN1, CADPS2, CNTNAP2) and regulation of gene expression (i.e., CTNNB1, FOXP1) previously associated with neurological disorders including autism [ 71, 73]. Three of these gene variants (NRXN1, CNTNAP2, and FOXP1) are inherited from the BAP mother, while two are de novo mutations (CADPS2 and CTNNB1). Nevertheless, there will be additional contribution of several other small effects inherited and de novo variations in both protein-coding and regulatory regions of genes involved in the functional pathways associated with synaptic vesicle cycle and transmission (Fisher’s exact test with Bonferroni step down corrected P?=?0.001 [ 60]; Supplementary Table 7).
The rare variation in NRXN1 genes (G?=?A mutation at chr2:50847195; rs78540316) results in a Pro429Ser structural change in LNS domain 2 of NRXN1a (UniProtKB - Q9ULB1 [ 74]). The extracellular domain structure of NRXN1a and other alpha neurexins is characterized by the linear assembly of six LNS domains, with the L6 LNS domain extended to form an “L” shape; this assembly has been shown to be important for their function as synaptic organizers [ 61]. Continuous electron density of bovine NRXN1a (PDB: 3QCW) is visible for L2-L6 LNS domains showing a concatenated arrangement of these domains mediated by ß11-ß12 loops, which play an important role in the interaction between LNS domains [ 61] (Fig. 4a). The packing of this loop in the interface of L2-L3 LNS domain involves insertion of Pro436 into a hydrophobic pocket of NRXN1a (Fig. 4b). Pro436 is highly conserved (Supplementary Figure 4), supporting its important role for the structural arrangement of NRXN1a. Analysis of the putative effects of replacing this conserved proline residue with serine on the stability of NRXN1a was undertaken with the FoldX force field, which has been shown to accurately predict the effects of mutations on the free energy of unfolding of proteins [ 75]. This revealed that the mutation is moderately (1.81?kcal/mol) to severely (3.18?kcal/mol) destabilizing, depending on whether Ser436 can hydrogen bond to the disordered lysine residue (Lys336), due to the loss of these stabilizing hydrophobic interactions (Fig. 4c). Given the importance of Pro436 to the correct packing of the L2-L3 LNS domains, the mutation of this conserved proline residue to serine (equivalent to Pro429Ser mutation in the human NRXN1) will most likely result in loss-of-stability of NRXN1a and disruption of the L2–L3–LNS domains interface. Thus, the predicted destabilizing effects of the Pro429Ser mutation provides a plausible structural change that will affect the structural integrity and thus the function of NRXN1a.
Fig. 4
 Structural modeling of the Pro429Ser NRXN1a mutation using the crystal structure of bovine NRXN1a (PDB: 3QCW). a The ß11-ß12 loops comprise of a large part of the L2-L3-LNS domain interfaces in NRXN1a. In the L2 LNS domain, the loop is characteristically bent in a horseshoe type conformation, with two proline residues Pro433 and Pro436 (magenta: equivalent to human Pro426 and Pro429) stabilizing the tightly kinked structure. b The packing of this loop in the interface involves insertion of Pro436 into a hydrophobic pocket of NRXN1a. c The Pro436Ser replacement (equivalent to Pro429Ser in humans) analyzed with the FoldX force field revealed this mutation is moderately (1.81?kcal/mol) to severely (3.18?kcal/mol) destabilizing, depending on whether Ser436 can hydrogen bond to a disordered Lys336
Full size image
Considering this structural evidence, we propose that putative loss of NRXN1a protein function may additively contribute to mutant miR-873-5p dysregulation of the NRXN2 expression (Fig. 3e) and increase the threshold of genetic liability regarding their function in pre-synaptic transmission and synapse development in case f34-s1. Moreover, there may be further impact at the synapse, as the miR-873-5p-pulldown assay indicated the capability to bind a number of other ASD-risk genes. MiR-873-5p binds to key genes for synapse formation and function including all three SHANK family members (SHANK1, SHANK2 and SHANK3), three Neuroligin family members (NLGN2, NLGN3 and NLGN4X), Discs Large MAGUK Scaffold Protein 4 (DLG4 also known as PSD95), and two DLG associated proteins (DLGAP3 and DLGAP4), which have been linked to ASD by multiple studies [ 47, 76, 77] (Fig. 3e, Supplementary Table 6).
Discussion
Our analyses provide a novel and comprehensive genome-wide approach for identification and characterization of putative regulatory variations in an exome-sequenced-ASD cohort. By combining our computational TFBS and MRE predictions with cis-regulatory elements resources, we showed there was an overall cohort-level enrichment of the regulatory variants within the ASD-associated genes, derived from several gold-standard ASD datasets (Fig. 2) [ 30, 47, 48]. We specifically focused on a rare mutation observed in the miR-873 gene, and characterized its potential impact in the regulation of the synaptic genes.
ASDs have complex and heterogeneous genetic causes, and may arise from combinations of inherited and de novo variation in the protein-coding [ 4] and non-coding regions [ 78, 79]. To investigate a possible additive contribution of the regulatory SNVs, it therefore makes sense to focus on the clusters of variants, both regulatory and protein-coding, that are in the functional neighborhood of known ASD-associated gene networks and functional pathways [ 30]. Herein, we identified the rare loss-of-function variants in the regulatory regions potentially disrupting the expression of the “peripheral” genes incurring in non-zero effects on regulation of the “core” genes with more direct and specific roles in disease etiology. Our findings support the omnigenic model of complex disease [ 80], which proposes that any regulatory variation in the genes expressed in disease-relevant cell-types can contribute to disease etiology with non-linear effects on risk for that disease/disorder.
There have been several studies that have identified the functional gene modules involved in ASD-associated biological processes [ 2, 3, 48]. Two of the genome-scale co-expression modules constructed by Parikshak et al. [ 2] were enriched in our regulatory variant set; notably one of the modules was the same one (M16) that they had highlighted for the potential involvement in ASD. It is enriched for ASD relevant GO terms such as “synaptic transmission” and “homophilic cellular adhesion” and more specifically expressed in later fetal development and early infancy [ 2]. The hypothesis of Parikshak et al. [ 2], that M16 might be targeted by lower-risk inherited variants, in contrast to other modules involved in more fundamental early developmental processes, is supported by enrichment of our dataset of regulatory inherited mutations.
Our study identified four variants found within the miRNA genes that were predicted to disrupt gene regulation. Notably, we identified a rare variation in the seed region (the major factor in binding specificity) of miR-873-5p and functionally verified the changes in binding affinity of the miR-873-5p variant. MiR-873 is of particular interest as it is located within a neural gene LINGO2, which has been investigated with respect to essential tremor and Parkinson’s disease [ 81]. There are also a number of reported CNVs of LINGO2 found in ASD cases [ 62, 63, 64, 65], including one which overlays miR-873 [ 66, 67]. Furthermore, miR-873 promoter expression has also been found to be significantly enriched in human brain cells and tissues (Fantom5 miRNA Atlas; P-value?=?2e-40) [ 82]. It is still unknown whether these particular rare SNVs on miR-873-5p (rs777143952; MAF?<?6.6e-05) will be statistically over-represented in ASD cases, and only very large whole-genome sequencing (WGS) studies would have the necessary power to assign any statistical association to a specific rare SNV. However, the relevance of the rare variants outside protein-coding regions, for human complex traits and phenotypic variance, has only started to be explored in developmental disorders by recent studies using WGS data from the large samples sets of European ancestry [ 78, 79].
The binding profile of miR-873-5p measured by the pulldown assay includes several ASD-risk genes—all three members of the SHANK family (SHANK1, SHANK2, and SHANK3), several neuroligins (NLGN2, NLGN3, and NLGN4X), and neurexins (NRXN2 and CNTNAP2). This targeting information, combined with the neural expression of miR-873-5p and its host gene LINGO2, strongly suggest a potential regulatory role in synapse development and function. We also found that the miR-873-5p variant binds to the transcription factor FOXP1 transcript, suggesting a dysregulation of FOXP1 expression, which has previously been shown to play a critical regulatory role in CNTNAP2 expression [ 83], as well as a functional role of striatal pathways important for vocal communication [ 84]. Despite this miRNA mutation being inherited from an unaffected parent, the measurable loss-of-binding in several ASD-risk genes leads us to postulate that it may contribute additively to ASD risk, particularly when combined with the mutation load from the mother with recorded BAP. Consistent with this idea, is the convergence of functional disruption of neurexins in this individual—specifically an observed partial change in NRXN2 and CNTNAP2 binding affinity from the paternally inherited miR-873-5p mutation when combined with the maternally inherited NRXN1 and CNTNAP2 coding variants. Moreover, this ASD case (f34-s1) carry two additional de novo protein-coding variations in CADPS2 gene that are involved in neurotrophin release and interaction with dopamine receptor type 2 [ 85], and CTNNB1 gene associated with impaired social interactions and repetitive behaviors [ 86]. Considering the above evidence concerning pathways and processes associated with ASD (e.g., synapse development), this work showcases the importance of compounded effects of protein-coding and regulatory mutations in a biological systems framework, particularly in the case of de novo mutations, which contribute significantly to ASD risk [ 10].
We are aware of the limitations of using a human neuroblastoma cell line (SH-SY5Y), in which the transcript levels and alternative splicing isoforms might not necessarily capture the same molecular interactions as CNS neurons, and we are mindful that spatial and temporal gene expressions patterns of various brain cells and tissues have been shown to also vary. Nevertheless, our study is the first to propose an integrated approach to predict and validate a mechanistic hypothesis for a possible causal effect of a single nucleotide variation changing the binding affinity between miRNA and target transcripts. Hence, our approach still represents a significant and novel contribution to systematically evaluate the (under the same SH-SY5Y transcriptome background) loss or gain of affinity effects of the genetic variants in regulatory regions, such as miRNA genes.
We are also mindful that our analyses are focused on rare loss-of-function heterozygous DNA variations that are represented by a minor allele frequency of <?1% in the human population. We are similarly aware of the ongoing debate as to the relative contribution of the rare inherited alleles to ASD and that there exists an uncertainty associated with current estimates [ 87]. Much of this uncertainty surrounds the contribution of loss-of-function variations found in non-coding DNA that comprises majority of the genome. Arguably, in the absence of detailed whole-genome analysis, estimates concerning the contribution of the rare inherited non-coding variants to ASD remain speculative. The whole-genome sequencing approaches when combined with rigorous clinical and behavioral profiling of family cohorts and functional characterization of DNA variations are now providing the means to further identify and evaluate rare and de novo non-coding regulatory variations that may affect brain development [ 4, 6, 88, 89], yet we see that even exome-based data can yield useful information. A major confound regarding estimates of heritability surrounds the impact of epigenetic processes in the penetrance of heterozygous alleles. Although it is difficult to the access neural tissue, there are attempts to interpret global epigenetic events to identify patterns of methylation and other DNA modifying chemistries, and assess the regulatory significance of non-coding DNA variations occurring at these sites. Interestingly, Yuen et al. (2016) recently reported that de novo regulatory variations associated with DNMT3A and ADNP are likely to more directly affect the global methylation processes. With regard to the epigenetic mechanisms, any new data resource detailing non-coding sequences that control gene regulation or the stability of transposons and repetitive DNA that affect transcription, can logically be incorporated into this analysis pipeline to more completely account for other regulatory layer of DNA variations associated with ASD.
Public genomic resources are constantly improving, larger populations of SNVs will aid the identification of the rare variants and projects such as psychENCODE [ 90] and CommonMind consortium [ 91], aimed at generating detailed datasets of regulation in the brain, can only improve interpretations of the variants in the regulatory regions. The approach described here is equally applicable to the whole-genome sequencing data and to the evaluation of the variants identified in the genome-wide association studies; moreover it provides a generalized template for analysis of complex genetic disorders. When combined, more ASD-cases-sequence data, new genomic resources, and integrated systems-based approaches that examine the regulatory variants in ASD-associated genes and pathways, as identified in this study, should elucidate the role of the non-coding regulatory variants and their additive contribution to the risk of ASDs. |