This is a brief introduction to
the dnapath
package. This package integrates known pathway
information into the differential network analysis of gene expression
data. It allows for any network inference method to be used, and
provides wrapper functions for several popular methods; these all start
with run_
for ease of searching. Various helper function
such as summary()
and plot()
are implemented
for summarizing and visualizing the differential network results. This
package is a companion to the paper:
Grimes, T., Potter, S. S., & Datta, S. (2019). Integrating gene regulatory pathways into differential network analysis of gene expression data. Scientific reports, 9(1), 1-12.
The package contains two datasets, meso
and
cancer_pathways
. The “Package data” vignette shows how
these two datasets were created, and it illustrates the use of various
methods in the dnapath
package.
The meso
data contains gene expression data for two
groups (stage ii and stage iv) of Mesothelioma tumors.
The cancer_pathways
data is a list of cancer-related
(P53) gene pathways from the Reactome database.
These are the two primary inputs to the dnapath()
method
– a gene expression dataset and a list of pathways – and are used to
demonstrate the package usage in this vignette.
Note: the
dnapath
package is not limited to cancer data. The analysis demonstrated in this introduction can be applied to any gene expression datasets from any species.
The main function of the package is dnapath()
, which
performs the differential network analysis on a gene expression dataset
over a user-specified list of pathways. The output of this function is a
dnapath_list
object (or dnapath
object if only
one pathway is considered). An overview of its
arguments is shown here:
dnapath()
: The main function for running the
differential network analysis.
x
is a single matrix,
groups
must be specified to label each row.
groups
is a vector of length equal to the number of rows in
x
, and it should contain two unique elements (the two group
names).run_
. The default is run_pcor
which
computes partial correlations.n_perm == 1
, the
permutation tests are not performed. If n_perm
is larger
than the number of possible permutations, n_perm
will be
set to this value with a warning message.set.seed
prior to permutation test for each pathway. This allows results for
individual pathways to be easily reproduced.An overview of the other methods available in the
dnapath
package is given in the following sections.
The dnapath_list
and dnapath
objects
obtained from running dnapath()
can be summarized and
visualized using various methdos.
filter_pathways()
: Removes any pathways from the
results that were not significantly differentially connected (based on
the permutation test p-values for the pathway-level
connectivity).
subset()
: Subset the results based on pathways or
genes of interest.
dnapath_list
object.sort()
: Sorts the pathways in the results.
dnapath_list
object.by = "mean_expr"
, the mean expression of
each pathway across both groups; by = "mean_expr1"
or
by = "mean_expr2"
, the mean expression of each pathway in
group 1 or 2, respectively; by = "dc_score"
, the
differential connectivity score of the pathway;
by = "p_value"
, the p-value of the dc score;
by = "n_genes"
, the number of genes in each pathway; or
by = "n_dc"
the number of significantly differentially
conncted genes in each pathway.summary()
: Creates a summary table for a
dnapath_list
or dnapath
object.
plot()
: Plot the differential network from a
dnapath
object.
plot_pair()
: Plot the expression values of two
genes to visualize their marginal association.
The user may specify any list of gene sets to use in the analysis.
However, for convenience dnapath
provides a function for
obtaining a list of Reactome pathways for a given species. This list may
be useful for most analyses.
get_reactome_pathways()
: Connects to the Reactome
database (using the reactome.db
package) to obtain a list
of pathways for a given species.
min_size
genes are omitted from
the list.max_size
genes are removed from the
list.The Reactome pathways are based on entrezgene IDs, which are a unique
identifier for individual genes. It is recommended to also annotate
RNA-seq data using entrezgene IDs when obtaining gene expression counts.
However, when summarizing and visualizing the differential network
analysis results, it may be preferred to use gene symbols. The functions
in this section are used to rename the genes in a
dnapath_list
or dnapath
object, or a pathway
list.
rename_genes()
: Used to rename the genes contained
in the differential network analysis results. This is useful if the user
wants to, for example, create tables or plots summarizing the data using
gene symbols rather than Entrezgene IDs. Note: this function can also be
used to rename the genes in a pathway list.
dnapath_list
or
dnapath
object, or a pathway list.to = "symbol"
will rename entrezgene IDs to gene symbols; this will automatically call
the entrez_to_symbol
method (defined below) to obtain the
gene_mat matrix. The species arugment
must also be specified when to is used.entrez_to_symbol()
.entrez_to_symbol()
: Maps entrezgene IDs to gene
symbols using the biomaRt
package. The Reactome pathways
use entrezgene IDs, and many RNA-seq pipelines will use entrezgene IDs
for annotation. However, gene symbols are more recognizable and may be
preferred for summarizing and visualizing results. This method maps
entrezgene IDs to gene symbols, and the resulting two column matrix can
be used with the rename_genes()
method to rename entrezgene
IDs as gene symbols.
symbol_to_entrez()
: Maps gene symbols to entrezgene
IDs using the biomaRt
package. This can be useful for
renaming gene expression datasets that contain gene symbols. Note,
however, that gene symbols can be ambiguous and may map to multiple
entrezgene IDs. It is recommended to use entrezgene IDs throughout the
analysis and only rename with gene symbols when summarizing the
results.
get_genes()
: This method extracts the genes in a
dnapath_list
or dnapath
object, or from a
pathway list. This is not meant to summarize the results of the
differential network analysis, but rather to help when renaming the
genes in the results; the output is a vector of gene names that can be
used in entrez_to_symbol
or
symbol_to_entrez
.
There are also several network inference methods provided in this
package. Many of these are available in other R packages, but the format
of their inputs and outputs can vary. Wrapper functions have been
created to standardize these aspects. The method names are all prefixed
with run_
(for easy searching) and can be used as the
network_inference argument in dnapath()
.
The methods are listed here, and additional information/references can
be found in the package documentation.
run_aracne
, run_bc3net
,
run_c3net
, run_corr
, `run_genie3
,
run_glasso
, run_mrnet
, run_pcor
,
and run_silencer
.
The default method used is run_pcor
, which infers the
gene-gene association network using partial correlations. It is
important to note that different measures may be better suited for
answering different research questions, however those details are beyond
the scope of this vignette.
Note: For more information on any method, use the
help()
or?
command in R to open the documentation (for examplehelp(run_aracne)
or?run_pcor
).
This example will analyze the meso
data provided in the
package. (The “Package data” vignette shows how these data were
obtained.) First we load the data and take a look at it.
## List of 2
## $ gene_expression: num [1:32, 1:150] 10.18 10.01 7.88 10.2 9.31 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:32] "TCGA.3H.AB3S" "TCGA.3U.A98E" "TCGA.3U.A98G" "TCGA.SC.AA5Z" ...
## .. ..$ : chr [1:150] "84883" "207" "208" "10000" ...
## $ groups : chr [1:32] "stageii" "stageiv" "stageiv" "stageiv" ...
The data is a list containing (1) a gene expression data set of 32 samples and 150 genes and (2) a vector indicating which group each sample belongs to – stageii (group 1) or stage iv (group 2).
The differential network analysis compares the gene-gene association
network between the two groups (stage ii vs stage iv). The
dnapath
package enables the use of known pathway
information into the analysis, but a pathway list is not required. The
dnapath()
method can be run on the full gene expression
dataset, without using pathways:
# Run dnapath using the gene expression and group information from meso dataset.
results <- dnapath(meso$gene_expression,
pathway_list = NULL,
group_labels = meso$groups)
results
## Differential network analysis between stageii (group 1) and stageiv (group 2).
## Results for an unnamed pathway (out of 1 pathways analyzed) containing 150 genes.
## Pathway p-value = 0.475. The mean expression of genes in the pathway is 9.3 in group 1 and 9.4 in group 2.
## # A tibble: 150 × 6
## pathway genes dc_score p_value mean_expr1 mean_expr2
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 unnamed 246175 0.116 0.00990 9.34 9.34
## 2 unnamed 2475 0.165 0.0297 10.2 10.3
## 3 unnamed 1616 0.139 0.0297 10.9 10.7
## 4 unnamed 8793 0.149 0.0396 6.90 7.02
## 5 unnamed 9099 0.122 0.0396 8.91 9.29
## 6 unnamed 94241 0.131 0.0495 10.1 10.1
## 7 unnamed 5933 0.0466 0.0495 5.95 6.30
## 8 unnamed 29117 0.151 0.0594 9.95 10.0
## 9 unnamed 91875 0.0940 0.0594 7.24 7.43
## 10 unnamed 8797 0.0884 0.0594 6.43 6.27
## # ℹ 140 more rows
Printing the results provides a summary of the differential network analysis. The output shows that “stageii” and “stageiv” groups were compared. It says a single “unnamed pathway” was analyzed containing 150 genes; this is referring to the fact that all genes in the dataset were analyzed together and no pathway list was provided. The mean expression level of all genes in each group are also shown (these are 9.3 and 9.4, respectively, for this data), and a p-value (0.723) of the overall differential connectivity.
The output shows the differential network analysis results at the
gene level. This table can be obtained using
summary(results)
, which returns a tibble
that
can be filtered and sorted using base R
functions or
methods from the dplyr
package. The columns in this table
include: genes, the genes analyzed in this pathway;
dc_score, the differentially connectivity (DC) score
estimated for this gene; p_value, the permutation
p-value for the DC score (under the null hypothesis of no differential
connectivity); p_value_m, monotonized p-values, which
are a conservative estimate; mean_expr1, the mean
expression of each gene in group 1; mean_expr2, the
mean expression of each gene in group 2.
The plot
function can be used to view the differential
network. In this case, since all genes were analyzed together, the
differential network will show all genes present in the gene expression
dataset.
The
alpha = 0.05
argument will remove any edges whose differential connectivity score p-value is above 0.05.
With 150 genes, the visualization of the differential network is not very useful; it is difficult to extract much information from this plot. One of the benefits of using known pathway information is that it breaks down the analysis into smaller gene sets, and the resulting differential networks are more manageable and informative. These advantages are illustrated in the next section.
The main feature of dnapath
is that it incorporates
known gene pathways into the differential network analysis. A pathway is
nothing more than a set of genes that is thought to belong to a
particular biological process. The get_reactome_pathways
method can be used to obtain a list of pathways from the Reactome
database.
In this section, we will use the p53_pathways
data that
accompanies the package. TP53 is a oncogene that is often mutated in
many cancers, and the p53_pathways
list contains 13
pathways related to P53 signaling.
We refer the reader to the “Package data” vignette for details on how the
p53_pathways
list was created. That vignette illustrates how to useget_reactome_pathways()
.
The difference compared to the previous section is that the
pathway_list
argument is now specified when running
dnapath()
. Additionally, we will use the seed
argument (which is optional) so that each pathway result can be easily
reproduced individually.
data(meso) # Load the gene expression data
data(p53_pathways)
# Run the differential network analysis.
results <- dnapath(x = meso$gene_expression,
pathway_list = p53_pathways,
group_labels = meso$groups,
seed = 0)
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Warning: 1 instances of variables with zero scale detected!
## Differential network analysis results between stageii (group 1) and stageiv (group 2) over 13 out of 13 pathways analyzed.
## # A tibble: 13 × 7
## pathway dc_score p_value n_genes `n_dc (0.1)` mean_expr1 mean_expr2
## <chr> <dbl> <dbl> <int> <int> <dbl> <dbl>
## 1 Regulation of TP… 1.77 0.158 37 8 10.1 10.2
## 2 Regulation of TP… 1.15 0.931 30 0 10.2 10.2
## 3 Regulation of TP… 0.880 0.158 14 5 7.70 7.60
## 4 TP53 Regulates T… 2.35 0.644 49 2 9.09 9.16
## 5 TP53 Regulates T… 1.30 0.941 14 0 7.76 7.94
## 6 TP53 Regulates T… 2.03 0.238 44 6 8.68 8.73
## 7 TP53 regulates t… 1.05 0.139 14 5 8.52 8.46
## 8 Regulation of TP… 1.28 0.0990 19 3 10.8 10.8
## 9 TP53 Regulates T… 1.55 0.653 18 0 9.37 9.35
## 10 TP53 Regulates T… 1.32 0.535 20 0 8.59 8.61
## 11 TP53 regulates t… 0.954 0.792 21 0 9.88 9.92
## 12 TP53 Regulates T… 0.906 0.822 12 0 8.15 8.16
## 13 TP53 Regulates T… 0.998 0.0792 12 3 7.91 7.91
The results can be filtered using the filter_pathways
method to remove any pathways with differential connectivity p-values
above some threshold. This is demonstrated in the following code, which
uses a threshold of 0.1.
## Differential network analysis results between stageii (group 1) and stageiv (group 2) over 2 out of 13 pathways analyzed.
## # A tibble: 2 × 7
## pathway dc_score p_value n_genes `n_dc (0.1)` mean_expr1 mean_expr2
## <chr> <dbl> <dbl> <int> <int> <dbl> <dbl>
## 1 Regulation of TP5… 1.28 0.0990 19 3 10.8 10.8
## 2 TP53 Regulates Tr… 0.998 0.0792 12 3 7.91 7.91
The pathways can also be easily sorted based on different values. For example, the follow code sorts the pathways so that the one with the highest number of differentially connected genes is listed first. Note: this method of ordering will tend to favor larger pathways, since there are more changes for that pathway to contain differentially connected genes just by chance.
## Differential network analysis results between stageii (group 1) and stageiv (group 2) over 2 out of 13 pathways analyzed.
## # A tibble: 2 × 7
## pathway dc_score p_value n_genes `n_dc (0.1)` mean_expr1 mean_expr2
## <chr> <dbl> <dbl> <int> <int> <dbl> <dbl>
## 1 Regulation of TP5… 1.28 0.0990 19 3 10.8 10.8
## 2 TP53 Regulates Tr… 0.998 0.0792 12 3 7.91 7.91
Pathways can be plotted one at a time by indexing the results list. The following code shows how to print out the first pathway.
Note: The layout of the nodes is random when generating plots. For reproducible plots, the random number generator seed needs to be set prior to plotting.
# The plot layout is stochastic. Setting the RNG seed allows for reproducible plots.
set.seed(0)
plot(results[[1]], alpha = 0.05, only_dc = TRUE)
This plot uses entrezgene IDs, which are the identifiers used in the
gene expression and pathway list data. At this point in the analysis it
can be convenient to translate these IDs to the more recognizable gene
symbols. The rename_genes
method can be used to achieve
this.
Note: Internet connection is required to connect to biomaRt, which
rename_genes
uses to map entrezgene IDs to gene symbols. Thedir_save
argument can be set when runningrename_genes()
which will save the ID mapping obtained from biomaRt in the specified directory. This way, the mapping can be obtained once (using the internet connection) and accessed from memory in all future calls ofrename_genes()
. A temporary directory is used in this example.
## Connecting to biomaRt database...
## - saving gene info to /tmp/RtmpMlpVEH/entrez_to_hsapiens.rds
## Differential network analysis between stageii (group 1) and stageiv (group 2).
## Results for the pathway "Regulation of TP53 Activity through Methylation" (out of 13 pathways analyzed) containing 16 genes.
## Pathway p-value = 0.099. The mean expression of genes in the pathway is 10.8 in group 1 and 10.8 in group 2.
## # A tibble: 16 × 6
## pathway genes dc_score p_value mean_expr1 mean_expr2
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Regulation of TP53 Activity thr… TTC5 0.339 0.0297 7.24 7.43
## 2 Regulation of TP53 Activity thr… EHMT2 0.354 0.0396 11.0 10.9
## 3 Regulation of TP53 Activity thr… MDM4 0.223 0.0990 9.48 9.43
## 4 Regulation of TP53 Activity thr… ATM 0.215 0.139 10.2 10.4
## 5 Regulation of TP53 Activity thr… EHMT1 0.206 0.158 10.1 10.1
## 6 Regulation of TP53 Activity thr… UBA52 0.155 0.178 13.7 13.4
## 7 Regulation of TP53 Activity thr… CHEK2 0.212 0.208 7.76 7.72
## 8 Regulation of TP53 Activity thr… RPS2… 0.193 0.208 13.6 13.3
## 9 Regulation of TP53 Activity thr… EP300 0.136 0.267 10.0 10.2
## 10 Regulation of TP53 Activity thr… TP53 0.229 0.287 11.1 10.8
## 11 Regulation of TP53 Activity thr… SMYD2 0.186 0.287 9.93 9.86
## 12 Regulation of TP53 Activity thr… MDM2 0.211 0.327 11.0 11.1
## 13 Regulation of TP53 Activity thr… UBC 0.191 0.426 15.4 15.4
## 14 Regulation of TP53 Activity thr… JMY 0.168 0.436 8.11 8.49
## 15 Regulation of TP53 Activity thr… PRMT5 0.137 0.505 10.4 10.4
## 16 Regulation of TP53 Activity thr… UBB 0.139 0.535 14.2 14.1
set.seed(0) # Reset seed to use same layout as previous plot.
plot(results[[1]], alpha = 0.05, only_dc = TRUE)
The differential network plot shows a lot of information about the two groups being compared. First, the nodes are scaled based on the gene’s average expression across both groups. If the gene is differentially expressed, then the node will be shaded blue or red depending on whether the gene has higher expression in group 1 or group 2, respectively. For example, in the plot above we see that the node for POU4F1 is red, indicating a higher expression in group 2. However, the size of this node is very small, indicating that it’s mean expression is small across both groups. So, while it may have a high fold-change, its overall expression level is relatively low. This can be verified in the table above – the second row shows that POU4F1 has mean expression of 0.476 in group 1 and 0.811 in group 2, which is an almost two-fold change in expression, but is overall a very low level relative to the expression levels of other genes in this pathway. Notice that none of the genes with a large nodes have any color, which indicates there is no substantial difference in expression levels for these between the two groups.
The edges in the network are similarly colored: a blue edge indicates the gene-gene association is stronger in group 1, and a red edge indicates the association is stronger in group 2. The thickness of the edges are scaled according to the magnitude of the association, and the opacity is scaled based on the p-value of the edge’s differential connectivity score. That is, a thick, opaque (i.e. non-transparent) edge indicates a strong association with high statistical evidence of differential connectivity. But a thin, transparent edge indicates a relatively weaker association with less evidence of differential connectivity (high p-value). For example, the red edge between TP53 and BAND is thick – indicating that the association is strong (in at least one of the two groups) compared to other connections – and the edge is dark red – indicating that the association is stronger in group 2. In contrast, the edge between BAND and AKT3 is thin – indicating that the magnitude of association between these genes is relatively small compared to other associations – and the edge is light blue – indicating the association is stronger in group 1, but the statistical evidence of this association is relatively low.
The plot is useful for a quick overview of the findings; a detailed
report of the differential connections can be obtained using the
summarize_edges()
function.
## # A tibble: 11 × 6
## pathway edges dc_score p_value nw1 nw2
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Regulation of TP53 Activity through… ATM … 0.0303 0.0198 -0.0660 -2.40e-1
## 2 Regulation of TP53 Activity through… EHMT… 0.0548 0.0198 -0.0732 1.61e-1
## 3 Regulation of TP53 Activity through… EHMT… 0.0596 0.0198 -0.0289 2.15e-1
## 4 Regulation of TP53 Activity through… EHMT… 0.0711 0.0297 -0.166 1.01e-1
## 5 Regulation of TP53 Activity through… EHMT… 0.0305 0.0495 0.00952 1.84e-1
## 6 Regulation of TP53 Activity through… EP30… 0.0215 0.0297 0.0431 1.90e-1
## 7 Regulation of TP53 Activity through… MDM4… 0.0424 0.0297 -0.0670 1.39e-1
## 8 Regulation of TP53 Activity through… RPS2… 0.0461 0.0297 -0.0182 1.97e-1
## 9 Regulation of TP53 Activity through… TP53… 0.0533 0.0495 -0.0182 2.13e-1
## 10 Regulation of TP53 Activity through… TTC5… 0.0750 0.0396 -0.107 1.67e-1
## 11 Regulation of TP53 Activity through… UBA5… 0.0218 0.0297 0.148 5.42e-4
In row 6, we see the summary statistics for the BANP-TP53 edge. The values agree with our interpretation from the plot: there is an estimated association of -0.02 in group 1 and an estimate of 0.22 in group 2. This suggests that there is some difference in the perturbation in the genomic environment between these two groups that is either (1) causing an inactivation of a BANP-TP53 association in the stage II patients (group 1) or (2) an activation of this association in the stage IV patients (group 2). An external literature review may suggest that BANP promotes TP53 activation, which causes cell cycle arrest. If true, then the results of this analysis would suggest that this regulatory behavior is inactive in the stage II patients.
Note: The output of the
summarize_edges()
function (as well as other summary
functions) is a tibble
object, so it can be sorted,
filtered, and otherwise manipulated using functions from the
dplyr
package. The follow code shows an example where the
results are arranged by p_values and further filtered to remove any
edges that have an association magnitude lower than 0.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
tab <- summarize_edges(results[[1]])
tab <- dplyr::arrange(tab, p_value, decreasing = FALSE)
tab <- dplyr::filter(tab, pmax(abs(nw1), abs(nw2)) > 0.2)
tab
## # A tibble: 5 × 6
## pathway edges dc_score p_value nw1 nw2
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Regulation of TP53 Activity through Met… ATM … 0.0303 0.0198 -0.0660 -0.240
## 2 Regulation of TP53 Activity through Met… EHMT… 0.0596 0.0198 -0.0289 0.215
## 3 Regulation of TP53 Activity through Met… TP53… 0.0533 0.0495 -0.0182 0.213
## 4 Regulation of TP53 Activity through Met… MDM2… 0.0574 0.0594 -0.0267 0.213
## 5 Regulation of TP53 Activity through Met… JMY … 0.0282 0.0990 0.0393 0.207
Each edge in the differential network corresponds to a pair of genes.
To further investigate the association of a pair of genes, and to see
how their association differs between the two groups, the
plot_pair()
function can be used to create a scatterplot of
the gene expression data. We demonstrate this function by visualizing
BANP and TP53 genes.
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 32 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 32 rows containing missing values or values outside the scale range
## (`geom_point()`).
By default, this function will fit a loess curve for each group. Alternatively, a linear model can be used.
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 32 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 32 rows containing missing values or values outside the scale range
## (`geom_point()`).
These plots show the marginal association between two genes – it does
not take into account any the expression level of other observed genes.
When we conducted the differential network analysis using
dnapath()
earlier in this example, the
run_pcor()
function was used by default as our measure of
association. This is estimates the partial correlation, which is a
conditional association measure that takes into account the other genes
in the pathway. It is important to remember that the marginal plot
created by plot_pair()
does not show all the information
that the differential network analysis uses, however it may nonetheless
be a useful visualization to check the results obtained from the
analysis.
There is an option when creating visualizations or generating summary
tables of the differentially connected edges to require that at least
one of the two genes for a given edge is also significantly
differentially connected (as tested at the gene level). Enforcing this
requirement can help reduce the false discovery rate (at the potential
cost of reduced sensitivity). Setting the require_dc_genes
argument to TRUE
in the plot()
function and
summarize_edges()
function will enforce this requirement,
as shown in the code below. Note how there fewer differentiallly
connected edges shown here compared to the previous section.
set.seed(0) # Reset seed to use same layout as previous plot.
plot(results[[1]], alpha = 0.05, only_dc = TRUE, require_dc_genes = TRUE)
## # A tibble: 5 × 6
## pathway edges dc_score p_value nw1 nw2
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Regulation of TP53 Activity through Me… ATM … 0.0303 0.0198 -0.0660 -0.240
## 2 Regulation of TP53 Activity through Me… EHMT… 0.0711 0.0297 -0.166 0.101
## 3 Regulation of TP53 Activity through Me… EHMT… 0.0305 0.0495 0.00952 0.184
## 4 Regulation of TP53 Activity through Me… EP30… 0.0215 0.0297 0.0431 0.190
## 5 Regulation of TP53 Activity through Me… RPS2… 0.0461 0.0297 -0.0182 0.197
The default network inference method is run_pcor
, which
uses partial correlations to estimate the gene association network in
the two groups. This is changed by setting the
network_inference
argument in dnapath()
. For
example:
# Run the differntial network analysis using ARACNE.
results <- dnapath(x = meso$gene_expression,
pathway_list = p53_pathways,
groups = meso$groups,
network_inference = run_aracne)
Note: This code is not run here because it requires
the minet
package that is available on Bioconductor.
All of the summary and plotting functions behave the same as before
regardless of the network inference method used. Note however, if the
chosen method produces a dense network (such as run_corr
),
then the visualization may become more cluttered. In addition, some
methods are more computationally intensive (such as
run_genie3
), and the differential network analysis will
become more time consuming. To help counter this, the number of
permutations performed for the significance testing can be lowered (the
default is n_perm = 100
). For example:
# Run the differntial network analysis using GENIE3 with 20 permutations.
results <- dnapath(x = meso$gene_expression,
pathway_list = p53_pathways,
group_labels = meso$groups,
network_inference = run_genie3,
n_perm = 20)
Alternatively, there may be tuning parameters in the network
inference method that can speed up computation. For example, in
run_genie3
, the nTrees
argument might be
lowered to trade-off performance for computation time. This parameter
can be adjusted using the additional ...
argument of
dnapath
.
In this brief introduction to the dnapath
package, we
have reviewed the basic steps for conducting the differential network
analysis and for summarizing the results with tables and plots. The
examples here used the Mesothelioma dataset that accompanies the
package, but we reiterate that the package is not limited to analyzing
cancer datasets.