## Co-expression Network

A gene co-expression network (GCN) is an undirected graph, where each node corresponds to a gene, and each edge represent a co-expression relationship between them. Many of these relationships provide strong evidence for the involvement of new genes in core biological functions such as the cell cycle, secretion, and protein expression [1].

To measure the co-expression relationships between genes, i.e. construct gene co-expression network, Pearson correlation coefficient[2], mutual information, Spearman rank correlation coefficient and Euclidean distance are four commonly used approaches. Here, I give a simple example with R code to illustrate how Pearson correlation is used to infer the gene interactions.

Pearson correlation coefficient, is a popular measure to evaluate the linear correlation between two variables X and Y. It has a value between +1 and -1 where +1 represents these two variables has strong positive linear correlation, 0 means no linear correlation and -1 represents negative linear correlation. (Note that Pearson correlation coefficient is not stable to outliers)

#### For a population

$$\rho$$ is commonly used to represents the coefficient:
$$\rho_{X,Y}=\frac{cov(X,Y)}{\sigma_X \sigma_Y}$$

where $$cov$$ is the covariance, $$\sigma_X$$ and $$\sigma_Y$$ represents the standard deviation of X and Y, respectively. Alternatively, the formula for $$\rho$$ can be written as:
$$\rho_{X,Y}=\frac{E[(X-\mu_X)(Y-\mu_Y)]}{\sigma_X \sigma_X }$$
where $$cov,\sigma$$ keep the same definition as above, $$\mu$$ is the mean of $$X$$ and E is the expectation.

#### For a sample

Pearson's correlation coefficient when applied to a sample is commonly represented by the letter $$r$$. Given two samples both containing $$n$$ values, the formula for $$r$$ is:

where $$n$$ is the sample size, $$x_i~y_i$$ are the single samples indexed with i, and $$\bar{x}=\sum _{i=1}^{n}(x_{i})$$ is the sample mean; and analogously for $$\bar{y}$$.

Rearranging gives us this formula for $$r$$:

where $${\frac{x_{i}-{\bar{x}}}{s_{x}}}$$ is the standard score (and analogously for the standard score of y).

Let's go over an example using R

#Generate random values
set.seed(1)
X<-runif(10, min = 0, max = 1)
set.seed(100)
Y<-runif(10, min = 0, max = 1)
#Compute the Pearson correlation coefficient
p<-cor(X,Y,method="pearson")

#Output below
> p
> [1] 0.3563


#### Significant of Pearson correlation coefficient

By applying Pearson correlation coefficient to every pair of distinct genes, we can get a complete graph in which every edge is associated with its own coefficient. Next, an important question is that have significant co-expression relationship. It means after obtaining all values of Pearson correlation coefficient for all edges, we have to determine a significance threshold. Note that we only consider the statistical significant and ignore the biological insight for now. We admit that statistical significance may be biologically irrelevant.

We introduce some common threshold selection methods here:

• T-test. A common assumption is that there is NO relationship between X and Y in the population: $$\rho=0$$.The parametric test of the correlation coefficient is only valid if the assumption of bivariate normality is met 4,5.

• Null hypothesis $$H_0: \rho=0$$; Alternative hypothesis $$H_A: \rho\neq0$$ or $$H_A: \rho>0$$ or $$H_A: \rho<0$$
• Test statistic: $$t=\frac{r}{\sqrt{\frac{1-r^2}{n-2}}}$$ where the degrees of freedom for the t-distribution is n-2.
• Check the t-test distribution table to find the critical value
• T-test when $$r$$ is NOT assumed to be 0

• When $$\rho$$ is not zero, the distribution is skewed. Fisher $$z$$ transformation should be used. The formula for $$z$$:
$$\frac{1}{2}ln\frac{1+r}{1-r}$$
• The standard deviation is $$\sigma=\frac{1}{\sqrt{n-3}}$$
• Then calculate the z-score: $$z=\frac{x-\rho}{\sigma}$$
• Permutation test
• Compute the Pearson correlation coefficient$$r{obs}$$ for the observation on hand and its associate t statistics value $$t{obs}$$
• Permute the Y values among X values in the n! ways.
• For each permutation, calculate its correlation coefficient $$r$$ and associated $$t$$ statistic. Obtain a set of $$t*$$ values.
• Calculate p-value:
• For upper-tail test: $$p=\frac{the~number~of~t*'s>=t_{obs}}{n!}$$
• For two-sided test: $$p=\frac{the~number~of~|t*'s>=t_{obs}|}{n!}$$
library(gtools)
x <- runif(20, 1, 5)
y <- runif(20, 1, 5)
corr_init <- cor(x,y) # original correlation
N <- 1000 # initialize number of permutations
count <- 0 # counts correlation greater than corr_init
for (i in 1:N) {
y_perm <- permute(y)
if (cor(y_perm,x) > corr_init)
count <- count+1
}
p <- count/N #final p