br where Qis a variational
where Qis a variational distribution, DKL is the Kullback-Leibler dis-tance, and p(vi) and p(hi vi) are the prior and posterior, respectively.
The constrained posterior of a bicluster is obtained by multiplying the input matrix by a vector, and subsequently rectifying and normal-izing the code unit. To make the feature membership vectors and
sample membership vectors sparse, a Laplace prior on the parameters of the original RFN model and a component-wise independent Laplace prior for the weights W are introduced. To get the final biclusters of the input matrix we used threshold values H _thr and W _thr to filter out genes and samples in each bicluster as in .
RFN can easily get thousands of biclusters from a very large matrix efficiently. We iteratively ran RFN many times and each time only one bicluster with the highest absolute mean Z-score value and smallest p-value was selected. After many iterations, a large number of biclusters can be obtained. As in , we used the p-value of its most enriched biological pathway as the p-value of a bicluster. Specifically, the prob-abilities of having × genes of the same function in a bicluster of size n with a total of N genes can be computed using the following hy-pergeometric function:
where pis the percentage of that pathway among all pathways in the whole pathway terms. The p-value is defined in Eq. (4).
To get breast cancer-specific biclusters, only biclusters detected in breast cancer samples but not in normal samples are kept. As some genes belong to different functional categories, the biclusters extracted from a gene EPZ-6438 matrix should have overlap below a predefined threshold. Here, we used empirical 0.5 as suggested in Orzechowski et al. . The pseudo code of cancer-specific bicluster detection is given below:
Methods xxx (xxxx) xxx–xxx
In this method, the input is breast cancer and normal combined expression matrix EC and EN. The output is breast cancer-specific biclusters. The parameters include n_hidden (number of latent variables to estimate), n_iter (number of iterations to run the algorithm), learnrateW (learning rate of the W parameter), learnratePsi (learning rate of the Psi parameter), dropout_rate (dropout rate for the latent variables), minP (minimal value for Psi), H_thr (the threshold value used to extract features belonging to a bicluster) and W_thr (the threshold value used to extract samples belonging to a bicluster).
2.3. Prioritization of bicluster coding genes and miRNAs
We propose to prioritize breast cancer-related coding genes and miRNAs by integrating four aspects of information (as shown in Fig. 1). Only coding genes and miRNAs in breast cancer-specific biclusters are considered. For a coding gene or miRNA in a bicluster, the average differential correlation value dci is defined in Eq. (5).
where N is the total number of genes (coding genes and miRNAs) in a bicluster, fij = 1 if the changes in the correlation relationship between two genes and between two experimental conditions are both sig-nificant; otherwise, fij = 0. Fisher’s z-test is used to test differential correlation between two conditions (normal and cancer). To test whe-ther the two Pearson correlation coefficients in normal and cancer are significantly different, we transformed rN and rC into ZN and ZC, re-spectively . The Fisher’s transformation of rN is defined in Eq. (6).
Similarly, we transform rC to ZC. We used Eq. (7) to test the dif-ference between two correlations.
where nN and nC represent the sample sizes of normal and cancer samples, respectively. We used the local false discovery rate (fdr) in the fdrtool R package to test the significance [41,42].