Thursday, December 8, 2022
HomeTechnologyNetREX-CF integrates incomplete transcription factor data with gene expression to reconstruct gene...

NetREX-CF integrates incomplete transcription factor data with gene expression to reconstruct gene regulatory networks

NetREX-CF—method overview

The NetREX-CF model is a data integration framework for reconstructing GRNs by organically utilizing both gene expression E and a set of prior networks P = {P1, . . . Pd}. The main idea behind the NetREX-CF model is an integration of two complementary optimization strategies: (i) a machine learning component designed based on CF that is able to infer hidden features from the current observed prior networks P and utilize these features to recommend an improved GRN and (ii) a sparse NCA-based network remodeling component that can refine the topology of a GRN based on given gene expression E. These two computational components operate alternatively. The CF component recommends new edges to the current GRN and the sparse NCA-based network remodeling component screens the recommended edges and keeps the edges that are essential to explain the expression of a given gene. Once the sparse NCA-based network remodeling component confirms some of the recommended edges, the CF component further utilizes those retained recommended edges to make new edge recommendations for the sparse NCA-based network remodeling component to further examine (illustrated in Fig. 1).

Fig. 1: Method overview.
figure 1

Collaborative filtering (CF) and NCA-based gene expression modeling alternatively inform each other during a joint optimization process: CF recommends edges to be confirmed by the NCA model and used to improve CF.

Computationally, the system illustrated in Fig. 1 is achieved by simultaneous optimization of the following sets of variables: (i) the activities of TFs (matrix A), (ii) a weighted GRN (matrix S), and (iii) two feature matrices: the hidden features for target genes (X where the ith row xi represents the hidden feature vector for gene i) and the hidden features for TFs (Y where the jth row yj represents the hidden feature vector for TF j). The matrix A is optimized by the sparse NCA-based network remodeling component while the matrices X and Y are optimized by the CF component. Notably, the matrix S is the connection between the aforementioned two components and should be optimized by considering both components.

Formally, \(E\in {{\mathbb{R}}}^{n\times l}\) is the matrix of expression data of n genes in l experiments and prior network \({P}^{k}\in {{\mathbb{R}}}^{n\times m},\forall k\) is a weighted adjacency matrix of the bipartite graph that records the prior knowledge of regulations between m TFs and n genes. Matrix \(A\in {{\mathbb{R}}}^{m\times l}\) is the TF activity for m TFs in l samples and \(S\in {{\mathbb{R}}}^{n\times m}\) is a weighted GRN. We further define penalty matrix C and observation matrix B based on the set of prior networks P. The matrix C is used in CF component. For edges in the prior, the corresponding elements in C will assign larger values to make sure those edges will be kept in the final prediction. For edges not in the prior, the corresponding elements in C will assign smaller values to encourage new edges as recommendations. The matrix B is used to indicate which edges have prior information. Each element in C can be computed by \({C}_{ij}=1+a{\sum }_{k}{P}_{ij}^{k}\) (a = 60 suggested by ref. 18). If more than one prior network suggests the regulation between the ith gene and the jth TF, then Cij tends to have a larger value. Large Cij would enforce giving the regulation between the ith gene and the jth TF a lower ranking. Each element in B is binary and can be computed by Bij = 1 if \({\sum }_{k}{P}_{ij}^{k}\,\ne\, 0\) and Bij = 0 otherwise. \(X\in {{\mathbb{R}}}^{n\times h}\) contains feature vector xi for gene i and \(Y\in {{\mathbb{R}}}^{m\times h}\) contains feature vector yj for TF j. Then, our optimization problem is formalized as following:

$$\begin{array}{rlr}\mathop{\min }\limits_{S,A,X,Y}&{{{{{{{\mathcal{H}}}}}}}}(S,A)+\lambda {{{{{{{\mathcal{F}}}}}}}}(S,X,Y)&\\ s.t.&{\left\Vert {x}_{i}\right\Vert }^{2}\le 1,\,\forall i\\ &{\left\Vert {y}_{j}\right\Vert }^{2}\le 1,\,\forall j.\end{array}$$


where \({{{{{{{\mathcal{H}}}}}}}}(S,A):= {\left\Vert E-SA\right\Vert }_{F}^{2}+{\lambda }_{A}{\parallel} A{\parallel }_{F}^{2}+{\lambda }_{S}{\parallel} S{\parallel }_{F}^{2}+{\sum }_{ij}{\eta }_{ij}{\parallel} {S}_{ij}{\parallel }_{0}\) is the sparse NCA-based network remodeling component; \({\lambda }_{A}{\parallel} A{\parallel }_{F}^{2}+{\lambda }_{S}{\parallel} S{\parallel }_{F}^{2}\) are standard regularization terms and ∑ijηij Sij0 induces sparsity of a given prior GRN and therefore only essential edges that help to minimize \({{{{{{{\mathcal{H}}}}}}}}(S,A)\) are retained. Sij0 is the 0 norm that is 1 if Sij ≠ 0 and 0 otherwise.

In (1) \({{{{{{{\mathcal{F}}}}}}}}(S,X,Y):= {\sum }_{i,j}{{{{{\varOmega}} }}}_{ij}{({{{{{\varTheta}} }}}_{ij}-{x}_{i}^{T}{y}_{j})}^{2}\) optimizes the hidden features X and Y of the CF component; Θij is a binary matrix of edges to be predicted by the hidden features in the given iteration and Ωij encodes penalties that guide the predictions. Both ΘijSij0Bij and \({{{{{\varOmega}} }}}_{ij}:= {\bar{C}}_{ij}{\parallel} {S}_{ij}{\parallel }_{0}+{C}_{ij}(1-{\parallel} {S}_{ij}{\parallel }_{0})\) are defined based on Sij0 and the penalty matrix Cij is built from the prior information . Detailed explanation of Θij and Ωij are provided in Method Details section. For the initialization step, both Θij and Ωij are defined based on the prior networks only while in the subsequent steps they also take into account the results of the sparse NCA-based network remodeling component (illustrated in Fig. 1). To solve the problem (1), we first put all continuous terms together and define \(H(S,A):= {\left\Vert E-SA\right\Vert }_{F}^{2}+{\lambda }_{A}{\parallel} A{\parallel }_{F}^{2}+{\lambda }_{S}{\parallel} S{\parallel }_{F}^{2}\) and then put all the non-continuous terms together and define \(F(S,X,Y):= {\sum }_{i,j}{{{{{\varOmega}} }}}_{ij}{({\parallel} {S}_{ij}{\parallel }_{0}\oplus {B}_{ij}-{x}_{i}^{T}{y}_{j})}^{2}+{\sum }_{ij}{\eta }_{ij}{\parallel} {S}_{ij}{\parallel }_{0}\). Then the optimization problem has a general format of an objective function as Φ(S, A, X, Y) = H(S, A) + F(S, X, Y), where H(S, A) is continuous but non-convex and F(S, X, Y) is a composite function of 0 norm of elements of S and other variables so it is neither continuous nor convex. More importantly, Sij0 is coupled with xi and yj, so that Sij0 cannot be separated from F(S, X, Y) as a distinct term. To the best of our knowledge, there has been no known method that can optimize such a complex and non-convex function involving the inseparable 0 norm. To fill this gap, we propose here an algorithm, GPALM that generalizes the so-called PALM method22 and solves a class of problems of the format above, under the assumption that F(S, X, Y) is lower semi-continuous (see Supplementary Note 1). The GPALM method is fully described in Supplementary Note 2 where we also formally prove its convergence. The source code of NetREX-CF is available at:

Validation and benchmarking NetREX-CF on yeast data

To demonstrate the capability of our GRN reconstruction method, we tested using datasets from multiple species, which include yeast, fruit fly, and human. For yeast, we collect multiple datasets that measure different perspectives of gene regulation. These datasets include TF ChIP7,33,34, TF DNA binding motif7,35, genetic knockdown7,36,37, and yeast gene expression7,38,39,40. TF ChIP, motif, and genetic knockdown datasets, serving as prior knowledge for TF-gene interactions in the yeast GRN. The details of these priors are summarized in Table 1 and the overlap among priors is illustrated in Table 1. We further utilize TF-gene interactions extracted from YEASTRACT database41 as a gold standard to validate the performance of GRN reconstruction. These gold standard TF-gene interactions are supported by both DNA binding and expression evidence. The details of the gold standard TF-gene interactions and their overlap with the prior datasets are shown in Table 1. Results generated by NetREX-CF are benchmarked against the results obtained from the published sequential methods, all of which are GRN prediction methods that are able to use prior knowledge. In the sections that follow, we go into detail about the comparisons between two popular approaches that only consider gene expression, GENIE342 and GRNBoost243, as well as prior-based approaches like NetREX-CF, MerlinP7, NetREX2, LassoStARS6, the original CF18, the summation of all prior knowledge (PriorSum), and a technique that assigns a random confidence score (uniformly distributed between 0 and 1; hereafter, a random method). For a detailed description of parameter selection for competing methods, we refer the reader to Supplementary Note 3.

Table 1 Overlap between prior networks and the gold standard network.

To ensure an impartial comparison, we use Average Rank scores (ARS)18. For each method and for each gene i, we can obtain a list of TFs that are predicted to regulate gene i and sort this TFs by the confidence of the prediction (most confident have higher rank). We use \({r}_{ij}^{g}\) to denote the percentile-ranking of TF j within the ordered list of all TFs for gene i. Thus, \({r}_{ij}^{g}=0 \%\) means that TF j is predicted with the highest confidence to regulate gene i, preceding all other TFs in the list. On the other hand, \({r}_{ij}^{g}=100 \%\) when TF j is predicted to be the least possible TF for gene i or there is no prediction between TF j and gene i yielded by the method. Based on the gold standard TF-gene interaction dataset I, we set Iij = 1 if TF j regulates gene i in the gold standard dataset and Iij = 0 otherwise. For any gene i, we use the average rank of the gold standard edges in the list of TF predicted to regulate gene i as the measure quality of the prediction:

$${\overline{rank}}_{i}^{g}=\frac{{\sum }_{j}{r}_{ij}^{g}{I}_{ij}}{{\sum }_{j}{I}_{ij}}$$


Lower values of \({\overline{rank}}_{i}^{g}\) are preferable, as they indicate gold standard TFs for gene i have a lower rank than others. Furthermore, the overall ranking considering all genes can be computed by

$${\overline{rank}}^{g}=\frac{{\sum }_{i}{\overline{rank}}_{i}^{g}}{\#{{{{{{{\rm{genes}}}}}}}}\,{{{{{{{\rm{in}}}}}}}}\,I}$$


The denominator is the number of genes that have gold standard TFs in dataset I.

Similarly, for each TF we can measure the quality of the sorted list of genes predicted to be regulated by it:

$${\overline{rank}}_{j}^{t}=\frac{{\sum }_{i}{r}_{ij}^{t}{I}_{ij}}{{\sum }_{i}{I}_{ij}},\quad {\overline{rank}}^{t}=\frac{{\sum }_{j}{\overline{rank}}_{j}^{t}}{\#{{{{{{{\rm{TFs}}}}}}}}\,{{{{{{{\rm{in}}}}}}}}\,I},$$


where \({r}_{ij}^{t}\) denotes the percentile-ranking of gene i within the ordered list of all genes for TF j and \({\overline{rank}}_{j}^{t}\) is the average rankings for the gold standard genes for TF j. \({\overline{rank}}^{t}\) is the overall average rankings considering all TFs.

Figure 2a illustrates the comparison between the competing algorithms in terms of average rankings of gold standard TFs for each target gene. The sorted average ranking curve for NetREX-CF is below all other methods, indicating that the average rankings of gold standard TFs predicted by NetREX-CF for each gene are much lower than the rankings predicted by other methods. In the average rankings of gold standard genes among the genes predicted to be regulated by each TF, NetREX-CF still achieves the best average ranking score. Remarkably, PriorSum (the weighted edge summation of three priors, ARS = 0.228) is very competitive with NetREX-CF (Fig. 2b, ARS = 0.224) indicating that NetREX-CF takes the best advantage of the prior data. It is important to note that NetREX-CF outperforms the original CF (ARS = 0.288), which demonstrates that the integration of the CF model and sparse NCA-based model is beneficial.

Fig. 2: Performance comparison for all competing algorithms on the yeast dataset.
figure 2

a The performance of the methods on the task of predicting regulating TFs: for each algorithm, we compute for each gene the rankings of gold standard edges \({\overline{rank}}_{i}^{g}\) adjacent to it, sort them in descending order, and plot the sorted average rankings. b The performance of the methods on the task of predicting regulated genes using a measure similar to in (a) but focusing on genes regulated by TFs. c The performance of the methods on the task of predicting regulating TFs that are not observed in prior data. The procedure is the same as in (a) but only the gold standard edges that are not included in the prior knowledge are used for the evaluation. d The performance of the methods on the task of predicting regulated genes not observed in prior data (similar to (c) but focusing on genes regulated by TFs). e Overlap between priors. f Venn diagram for gold standard dataset and the union of the three prior datasets. g Summary of the average rankings of each algorithm for tasks reported in panels (ad). The lowest scores are highlighted in bold red.

Next, in order to demonstrate the advantages of NetREX-CF in predicting ranks for missing data (edges that do not appear in the prior knowledge datasets), we identified all edges that are in the gold standard dataset but are not supported by any prior dataset. Indeed, a large portion of the gold standard dataset (4064 out of 10,525 gold standard TF-gene interactions) is not covered by any prior dataset (Fig. 2f). Therefore, we can use these gold standard interactions with missing prior data to compare the ability of the competing methods in recovering rankings under the assumption of missing data. NetREX-CF achieves much lower ARS for those missing data (Fig. 2c, d). In summary, NetREX-CF achieves the lowest, thus the best, overall ARS (Fig. 2g).

Reconstruction of the GRN for Drosophila Schneider 2 (S2) cell line and validation of NetREX-CF on new experimental data

While the datasets from yeast have been widely served as the only available benchmarking dataset for the evaluation of GRN inference methods, evaluating any GRN inference method using the only case would not produce convincing results. Therefore, for a more comprehensive assessment of network inference by NetREX-CF, we have performed gene expression profiling of Drosophila S2 cells, in which we selectively silenced each known TF. We chose S2 cells for these reasons. First, the Drosophila community extensively uses S2 cells, and there exists a huge amount of public resources. Such a resource includes TF binding profiling (i.e., ChIP-chip/Seq), performed by the Model Organism ENCyclopedia Of DNA Elements consortium (modENCODE44). Second, the cells are suitable for high-throughput RNAi studies45, whose outcome can be used to construct a benchmarking dataset when combined with genome-wide gene expression profiling.

As a prior, we collected publicly available datasets for S2 cells, which include 250 TF ChIP and 600 expression profiles (see Methods). TF DNA binding motifs were obtained from OnTheFly46, FlyFactorSurvey47, FLYREG v248, iDMMPMM49, and DMMPMM50. Similar to the yeast GRN, we used TF ChIP and TF DNA binding motifs as prior knowledge for TF-gene interactions in the S2 cell GRN. In addition, we also included the gene co-expression network (from 600 expression profiles) to increase the overall number of TFs with prior information (Table 2).

Table 2 Overlap between prior networks and the benchmarking network for Drosophila S2 cell line.

In order to validate the performance of our NetREX-CF and also to provide the finest dataset for GRN inference evaluation, we experimentally generated and sequenced a total of 1920 RNA-Seq libraries following RNAi knockdown of 488 genes (465 after filtering, see the Methods) that encode all expressed TFs in Drosophila S2 tissue-culture cells. Since this is a new dataset that is likely to be valuable outside the current study, we start with a robust analysis of the properties of the data. We used long double-strand RNA molecules (dsRNA) to selectively silence target TFs using the bathing method45. For the most of targets (452 out of 488), we used two different long dsRNAs and these were biologically duplicated to give four measurements per gene. Because we were interested in using these data to trace gene network perturbations without confounding effects from long-term RNAi treatment, we chose the 1-day incubation for the main experiment. Overall, we observed that the median RNA expression was 23.4% after the knockdown (see Methods).

These knockdowns disrupted coherent pathways, not only those controlling the cell cycle, but also cellular differentiation and developmental processes (Fig. 3). We observed subsets of genes with specific functional categories display correlated gene expression changes upon the depletion of different TFs (Fig. 3b–e). For example, the genes in Cluster 2 in Fig. 3a have enriched Gene Ontology (GO) terms in “growth” (GO:0040007, adj. p value = 1.89e–06, Hypergeometric test corrected with Benjamini and Hochberg method) and “regulation of protein modification process” (GO:0031399, adj. p value = 3.86e–08, Fig. 3b). On the other hand, the genes in Cluster 4 play a role in cell proliferation, which should be the most important phenotypic characteristic of dividing tissue-culture cells. Such genes showed enrichment of GO terms in “mitotic cell cycle process” (GO:1903047, adj. p value = 3.47e–12), “DNA replication” (GO:0006260, adj. p value = 4.21e–12) and “ribosome biogenesis” (GO:0042254, adj. p value = 8.76e–15) that are required for cell cycle progression (Fig. 3). Cluster 6 genes also function in cell proliferation with “ribonucleoprotein complex biogenesis” (GO:0022613, adj. p value = 3.25e–08) and “sister chromatid segregation” (GO:0000819, adj. p value = 1.29e–05), although this cluster includes metabolic genes as well (Fig. 3). Remarkably, RNAi-based differential expression of Cluster 4 and 6 genes demonstrated an anti-correlation to that of Cluster 10 genes (Pearson’s r = –0.86), which comprise developmental genes that function in “post-embryonic animal organ development” (GO:0048569, adj. p value = 2.30e–44), “metamorphosis” (GO:0007552, adj. p value = 4.65e–44), “neuron projection development” (GO:0031175, adj. p value = 2.57e–42), or “wing disc development” (GO:0035220, adj. p = 4.59e–42, Fig. 3).

Fig. 3: Gene expression changes upon targeted knockdown of Drosophila TFs in S2 cells.
figure 3

a A heatmap summarizes differential gene expression between Drosophila S2 cells treated with TF RNAi versus control RNAi (dsRNA for E. coli LacZ gene). Columns. Depletion of 465 TFs. Rows. Gene expression changes in the log2 scale. Genes that are not differentially expressed from any of the TF RNAi experiments were not shown in this heatmap. The results are clustered using the k-means clustering method. be Results from Gene Ontology (GO) analysis for the clustered genes in (a). The four clusters with the largest gene sets are demonstrated. Circle size. A number of genes are associated with a specific GO term or hits. Circle color. Adjusted p value. f A histogram that summarizes numbers of differentially expressed genes. Vertical red line. Median of the observation.

We further noticed that knockdowns of TFs with overlapped functions result in positively correlated gene expression changes. For example, we did GO enrichment analysis for TFs in the group “a” in Fig 3a. We found that TFs in group “a” are enriched in “instar larval or pupal morphogenesis” (GO:0048707, adj. p = 2.05e–04) and “metamorphosis” (GO:0007552, adj. p = 2.93e–04). Intriguingly, those TFs, when depleted, commonly reduced metabolic gene expression that function in the “amino sugar catabolic process” (GO:0046348), which is the function of genes in Cluster 8 (genes in Cluster 8 in Fig. 3a are enriched in GO:0046348 with adj. p = 0.00011) or “monocarboxylic acid catabolic process” (GO:0072329), which is the function of genes in Cluster 12 (genes in Cluster 12 in Fig. 3a are enriched in GO:0072329 with adj. p = 0.012). Likewise, we found a group of TFs (“b” in Fig. 3a) whose knockdown upregulates growth or protein metabolism-related genes (Cluster 2 in Fig. 3a), but downregulates cell cycle or carbon metabolism genes (Cluster 6 in Fig. 3a). Altogether, these observations indicate that coordinated responses exist in the Drosophila S2 gene network against the perturbations from individual TF knockdown. We then utilized this RNAi data to build an RNAi-based network and used this network as the benchmark dataset for performance evaluation of the competing methods. For the downstream analysis, we defined differentially expressed genes based on log 2-fold changes (either >1 or <−1) or adjusted p value (<0.05). Details of the RNAi network and its overlap with the other prior networks are illustrated in Table 2.

Using the S2 cell gold standard from the RNAi experiments, we evaluated the performance of our NetREX-CF with MerlinP7, NetREX2, LassoStARS6, the original CF18, PriorSum, GRNBoost243, GENIE342, and the method that makes a random choice. We describe the parameter selection for competing methods in Supplementary Note 3.

Figure 4a exhibits the comparison between the competing algorithms in terms of average rankings of TFs for each target gene in the benchmark dataset. The curve of NetREX-CF is below all other methods, indicating, on average, NetREX-CF ranks TFs identified from RNAi knockdown data lower than other competing methods. In addition, Fig. 4b illustrates the comparison between the competing algorithms in terms of average rankings of target genes for each TF in the benchmark dataset. Intriguingly, many of the tested methods failed to perform better than the random method. Similar observations have been made by previous studies3,4. Only NetREX-CF displayed better performance than the random methods, in terms of the overall ARS, far surpassing all the other methods (Fig. 4c). However, when the ranks of TFs were considered, NetREX-CF has better performance for 182 out of 458 TFs; its prediction power started to be weaker than the random method for the rest of the 276 TFs, analogous to other methods. Still, this result is far better than other methods that only predicted approximately the top 100 TFs (GENIE3 and GRNBoost2), or performed worse than the random methods for all the TFs. We believe that this is due to the TFs that regulate a large number of target genes (see Discussion).

Fig. 4: Performance comparison for all competing algorithms on the S2 cell line.
figure 4

a The performance of the methods on the task of predicting regulating TFs: for each algorithm, we compute for each gene the rankings of edges in the benchmark dataset adjacent to it, sort them in descending order, and plot the sorted average rankings. b The performance of the methods on the task of predicting regulated genes using a measure similar to in (a) but focusing on genes regulated by TFs. c Summary of the average rank scores in (a) and (b). The lowest scores are highlighted in bold red. d The bar plot indicates the number TFs whose target genes in the benchmark dataset are significantly enriched at the top of the ranked gene list (p value < 0.01 with 1000 random permutations), as predicted by each method in the GSEA analysis. e, f The Venn diagram between TFs is identified by the top two methods in (b): GENIE3 and NetREX-CF. The target genes of TFs shown in are enriched in the GO term of the mitotic cell cycle (e) and nervous system development (f), respectively. gi The same plots as (a) and (b). The only difference is that all the competing methods do not use the ChIP prior.

For further comparison between the benchmark data and the GRNs inferred by each method, we performed Gene Set Enrichment Analysis (GSEA). We tested whether the target genes for each TF in the benchmark dataset are statistically enriched on the top of the ranked gene list (p value < 0.01), generated by each method. We found that NetREX-CF achieves the best performance, compared to the other methods, by significantly recovering behaviors of 94 TFs in the benchmark network (Fig. 4d). It is worth mentioning that this number is larger than the summed number of TFs confirmed by the original NetREX (=38) and CF (=50), which represents the synergy between the two algorithms in NetREX-CF. GENIE3 followed NetREX-CF with 76 TFs recaptured.

To evaluate the biological relevance of the inferred GRNs, we compared NetREX-CF and GENIE3, which are the two best methods based on our validation. We performed GSEA analysis and investigated TFs whose inferred target genes are significantly enriched for two Gene Ontology (GO) terms: mitotic cell cycle (GO:0000278), or nervous system development (GO:0007399). Potential target genes under these functional terms are expected to be activated or repressed, respectively, in Drosophila S2 cells, which is a tissue-culture cell line derived from the late embryos51. NetREX-CF predicted a total of 172 TFs for the regulation of mitotic cell cycle genes when the number was 26 for GENIE3 (Fig. 4e). There was a total of 10 TFs inferred by both methods. The overlapped TFs include Myb, which is a fly ortholog of human MYB and MYBL1, as well as Sin3A. These TFs are known to regulate the G2 or G2/M transition of the cell cycle52,53. We also found that Grp, a product of grp (grapes) gene and the fly ortholog of human CHEK1, from the list. The TF plays a role in cell cycle checkpoint in response to DNA damage54. In addition, while the overlap was partial, both methods identified members of the E2F/RB transcription factors, or Myb-interacting proteins, which include E2f2 and Rbf2 (GENIE3), E2f1, Rbf, Mip120, and Mip130 (NetREX-CF). These TFs form a complex and regulate the G1/S transition of the cell cycle55,56. Therefore, both NetREX-CF and GENIE3 could efficiently predict TFs that regulate mitotic cell cycle genes.

Since NetREX-CF (172) could predict more TFs than GENIE3 (26) as the regulator of the same functional group (mitotic cell cycle), we asked whether NetREX-CF has better efficacy than GENIE3. Remarkably, only NetREX-CF identified a set of TFs that are related to Polycomb and Trithorax Group [Polycomb (Pc), Polycomblike (Pcl), Pleiohomeotic-like (Phol), Enhancer of zeste E(z), Additional sex combs (Asx), and Scm-related gene containing four MBT domains (Sfmbt), Trithorax-related (Trr)]. While these TFs are well known for their regulation of development and differentiation57, there also are involved in cell cycle control58,59,60,61,62. Moreover, only NetREX-CF, but not GENIE3, predicted members of the mediator complex (CycC, MED1, MED23, MED24, and MED26) as regulators of the mitotic cell cycle genes. Multiple studies have reported the role of these components in cell cycle control or proliferation62,63,64,65. We obtained a consistent result for the TFs that control the nervous system development (Fig. 4f). Both NetREX-CF and GENIE3 identified key regulators, such as Kay (Kayak, fly ortholog of human FOS) and Jra (Jun-related, fly ortholog of human JUN and JUND), or Foxo (Forkhead box, sub-group O), which are well studied for their roles in nervous system development in fruit flies and/or human66,67,68,69,70. However, only NetREX-CF could capture, for example, Sox14, which is one of the master regulators of neurogenesis in Drosophila71. Collectively, we observed that NetREX-CF outperforms GENIE3 in inferring biologically relevant GRNs for those functional categories.

Importantly, NetREX-CF surpassed the competing methods even when we did not use the TF binding prior (e.g., ChIP-Seq). ChIP experiments largely rely on the availability of suitable antibodies that can pull down TFs under the cross-linked condition. Therefore, we tested how the absence of ChIP data would change the performance of NetREX-CF using our Drosophila S2 cell benchmark dataset (Fig. 4g–i). The performance of GRNBoost2 and GENIE3 is not altered since they do not use any prior information (ARS = 0.497 for the TF predictions for both). In general, the performance of the prior-based methods (e.g., MerlinP, LassoStARS, the original NetREX, and PriorSum) became worse (compare Figs. 4a, g) when no TF binding prior was used (compare Figs. 4a, g). This was the case for NetREX as well (ARS 0.453 vs. 0.478 with, and without, the ChIP results, respectively), underscoring that the use of the TF binding prior greatly improves the GRN inference. However, NetREX-CF still outperforms all the competing methods even without the information, although the difference with the second-best, GENIE3, becomes smaller (ARS 0.478 vs. 0.497 for the gene prediction). This finding shows that NetREX-CF still functions and outperforms alternative approaches in the absence of ChIP-Seq or ChIP-chip data. However, we advise employing the TF binding prior wherever possible to get the best prediction with NetREX-CF.

Reconstruction of cell-type-specific GRNs from human single-cell RNA-Seq studies

ScRNA-Seq technology is rapidly adopted by biologists for uncovering cell-type-specific transcription programs. To demonstrate that NetREX-CF can generate cell-type-specific GRNs using scRNA-Seq results, we benchmarked all competing methods using previously published human studies. To build GRN for human hepatocyte-like cells (hHep GRN), we collected gene expression profiles from a scRNA-seq analysis of induced pluripotent stem cells (iPSCs) in two-dimensional culture differentiating to hepatocyte-like cells31. We utilized the following datasets as priors: (i) functional prior GRN from STRING72,73, and (ii) non-specific prior GRN from DoROthEA, RegNetwork, and TRRUST73,74,75,76. To evaluate the inferred hHEP GRN, we extracted all the cell-type-specific GRNs from ENCODE, ChIP-Atlas, and ESCAPE databases for ChIP-Seq data from the same or similar cell type73,77,78,79. The summary of the single-cell RNA-Seq, as well as the prior and benchmark GRNs, are elaborated in Tables 3 and 4, respectively.

Table 3 Statistics for the scRNA-Seq data for hHEP and hSCE.
Table 4 Statistics of the prior and benchmark networks for hHEP and hESC.

We compared our NetREX-CF with the prior-based methods (NetREX, MerlinP, and LassoStARS), which can incorporate the prior information to build the hHEP GRN as well as the expression-based methods (GRNBoost2 and GENIE3), which rely on only the scRNA-Seq data to build the hHEP GRN. These two methods were the best-performing methods for predicting directed networks using scRNA-Seq data in a recent evaluation, and therefore, provide a good baseline for the comparison4. NetREX-CF achieves the lowest overall ARS by a large margin (Fig. 5a, ARS = 0.349, compared to 0.5 from the random method), indicating that the rankings of the TFs predicted by NetREX-CF are much lower than the competing methods from the benchmark. In the case of their target gene prediction, we found that all methods have comparable degrees of performance [Fig. 5b, c, ARS 0.486–0.574 with NetREX-CF performing the best]. We think that this is because the TFs in the benchmark dataset control thousands of genes, resulting in an ARS of about 0.5 (see the discussion section).

Fig. 5: Performance comparison for all competing algorithms on building cell-specific GRNs from scRNA-Seq results.
figure 5

a The performance of the methods on the task of predicting regulating TFs in hHep GRN. b The performance of the methods on the task of predicting target genes in hHep GRN. c Comparison of the overall ARS in (a) and (b). The lowest scores are highlighted in bold red. d The performance of the methods on the task of predicting regulating TFs in hESC GRN. e The performance of the methods on the task of predicting target genes in hESC GRN. f Comparison of the overall ARS in (d) and (e).

As the second example, we built the human embryonic stem cell GRN (hESC GRN) by using scRNA-Seq results of the differentiation protocol to produce definitive endoderm cells from human embryonic stem cells32. Analogous to what we performed for building hHEP GRN, we utilized a functional prior network based on the STRING DB, non-specific prior networks from other databases, and cell-type-specific GRN from ChIP-Seq datasets (Tables 3 and 4 for the summary information). Consistent with our observation in the hHEP GRN, NetREX-CF demonstrated superior performance to the other methods in recovering the rank of TFs in the benchmark dataset (Fig. 5d, ARS = 0.366 compared to 0.500 from the second-best method, GENIE3). However, in agreement with the hHEP GRN results, recovering the rank of target genes in the benchmark dataset displayed less distinguishable performances for all the methods (Fig. 5e, f, ARS 0.489 (best, NetREX-CF) to 0.575 (worst, PriorSum)).

Abdullah Anaman
Abdullah Anaman
I am a highly competent IT professional with a proven track record in designing websites, building apps etc. I have strong technical skills as well as excellent interpersonal skills, enabling me to interact with a wide range of clients.


Please enter your comment!
Please enter your name here
Captcha verification failed!
CAPTCHA user score failed. Please contact us!

Most Popular

Recent Comments