Co-expression Network

A gene co-expression network (GCN) is an undirected graph, where each node corresponds to a gene, and each edge represent a co-expression relationship between them. Many of these relationships provide strong evidence for the involvement of new genes in core biological functions such as the cell cycle, secretion, and protein expression [1].

To measure the co-expression relationships between genes, i.e. construct gene co-expression network, Pearson correlation coefficient[2], mutual information, Spearman rank correlation coefficient and Euclidean distance are four commonly used approaches. Here, I give a simple example with R code to illustrate how Pearson correlation is used to infer the gene interactions.

Pearson correlation coefficient, is a popular measure to evaluate the linear correlation between two variables X and Y. It has a value between +1 and -1 where +1 represents these two variables has strong positive linear correlation, 0 means no linear correlation and -1 represents negative linear correlation. (Note that Pearson correlation coefficient is not stable to outliers)

For a population

\(\rho\) is commonly used to represents the coefficient:
$$\rho_{X,Y}=\frac{cov(X,Y)}{\sigma_X \sigma_Y}$$

where \( cov \) is the covariance, \( \sigma_X \) and \( \sigma_Y \) represents the standard deviation of X and Y, respectively. Alternatively, the formula for \(\rho\) can be written as:
$$\rho_{X,Y}=\frac{E[(X-\mu_X)(Y-\mu_Y)]}{\sigma_X \sigma_X }$$
where \(cov,\sigma\) keep the same definition as above, \(\mu\) is the mean of \(X\) and E is the expectation.

For a sample

Pearson's correlation coefficient when applied to a sample is commonly represented by the letter \(r\). Given two samples both containing \(n\) values, the formula for \(r\) is:

where \(n\) is the sample size, \(x_i~y_i\) are the single samples indexed with i, and \(\bar{x}=\sum _{i=1}^{n}(x_{i})\) is the sample mean; and analogously for \(\bar{y}\).

Rearranging gives us this formula for \(r\):

where \({\frac{x_{i}-{\bar{x}}}{s_{x}}}\) is the standard score (and analogously for the standard score of y).

Let's go over an example using R

#Generate random values
X<-runif(10, min = 0, max = 1)
Y<-runif(10, min = 0, max = 1)
#Compute the Pearson correlation coefficient

#Output below
> p
> [1] 0.3563

Significant of Pearson correlation coefficient

By applying Pearson correlation coefficient to every pair of distinct genes, we can get a complete graph in which every edge is associated with its own coefficient. Next, an important question is that have significant co-expression relationship. It means after obtaining all values of Pearson correlation coefficient for all edges, we have to determine a significance threshold. Note that we only consider the statistical significant and ignore the biological insight for now. We admit that statistical significance may be biologically irrelevant.

We introduce some common threshold selection methods here:

x <- runif(20, 1, 5)
y <- runif(20, 1, 5)
corr_init <- cor(x,y) # original correlation     
N <- 1000 # initialize number of permutations     
count <- 0 # counts correlation greater than corr_init
for (i in 1:N) {
     y_perm <- permute(y)
    if (cor(y_perm,x) > corr_init) 
        count <- count+1
p <- count/N #final p


[1] Stuart, Joshua M., et al. "A gene-coexpression network for global discovery of conserved genetic modules." science 302.5643 (2003): 249-255.
[5] Kowalski, Charles J. "On the effects of non-normality on the distribution of the sample product-moment correlation coefficient." Applied Statistics (1972): 1-12.