Reconstruction of Rna-seq bioinformatics analysis for Adaptive laboratory evolution of native methanol assimilation in Saccharomyces cerevisiae

Module 7BBG1002 group 1

Connecting to CREATE HPC

# Log in to HPC.
ssh -i~/.ssh/"key name" k********@hpc.create.kcl.ac.uk

Create directories to store raw data, outputs and reference genome.

cd /scratch_tmp/grp/msc_appbio
mkdir group1
cd group1
mkdir raw_reads
mkdir ref_gen
mkdir outputs

Acquiring sequences from the sequence read archive (SRA) with SRA Toolkit.

To download the sequences from the SRA, the SRA Toolkit was used. fasterq-dump is part of the SRA package and it was used to fetch the SRA files and convert them to FASTQ. The –gzip function is not supported in fasterq-dump and thus, we run it after downloading and converting had finished. –split-files was used to split the forward from the reverse reads as our data were paired-end.

# Go to the right directory.
cd /scratch_tmp/grp/msc_appbio/group1/raw_reads

# Searching for available modules corresponding to SRA Tools
module spider SRA Tools

# Loading module
module load sra-tools/3.0.3-gcc-13.2.0

fasterq-dump --split-files SRR11318268, SRR11318269, SRR11318270, SRR11318271 -O .

# After downloading and converting the SRA files to FASTQ, compress all the files in the directory.
gzip *

The results from downloading with –split-files gives 2 files per SRA, as mentioned before, one forward and one reverse. The suffix of the split files in the raw_reads/ directory is one with _1.fastq.gz and another with _2.fastq.gz.

FastQC

Creating conda environment to install and activate packages. Best practice for reproducibility and dependency management in computational workflows, especially if you work locally or the HPC does not provide a specific tool.

# Searching for conda in the HPC
module spider conda

# Loading conda
module load anaconda3/2021.05-gcc-13.2.0

# Creating conda environment 
create conda -—name myenv

# Activating conda environment
conda activate myenv

# Installing FastQC and MultiQC from Bioconda repository
conda install -c bioconda fastqc multiqc samtools

# Deactivating conda environment
conda deactivate

In the end, we did not use our conda environment to run FastQC and MultiQC as these modules are already installed in the HPC. We made a SBATCH script to perform FastQC and MultiQC to our raw fastq.gz files.

# Create and edit the script with nano.
nano fastqc.sh

#!/bin/bash

echo "start of pipeline"

module load fastqc
module load py-multiqc

# Specify input and output directories.
baseDirectory="/scratch_tmp/grp/msc_appbio/group1/raw_reads"
resultsDirectory="/scratch_tmp/grp/msc_appbio/group1/outputs/fastqc_output"

# Create results directory  if it does not exist. 
mkdir -p "$resultsDirectory"

# Run FastQC on all .fastq.gz files in baseDirectory
fastqc -o "$resultsDirectory" -t 4 "$baseDirectory"/*.fastq.gz 

echo "end of fastqc pipeline" 

echo "start of multiqc"

multiqc "$resultsDirectory"

echo "end of multiqc"

# Run the script interactively in the msc_appbio partition.
srun msc_appbio --pty /bin/bash
sbatch fastqc.sh
squeue -u k24090847

# Or run the script as a sbatch job. Select partition and specify cpu number if you do not want the default resources. 
sbatch -p msc_appbio --cpus-per-task = 12 fastqc.sh

# Check the status of your sbatch job.
squeue -u "knumber"

# Choose the rightdirectory to view the output files.
cd /scratch_tmp/grp/msc_appbio/group1/outputs

# List contents of the outputs directory.
ls

Move processed FASTQ files to local desktop from HPC using your local terminal.

# Open a terminal window and log into the remote server using sftp.
sftp -i~/.ssh/"key name" k********@hpc.create.kcl.ac.uk


# Change to the local directory where files will be downloaded
lcd /Users/User/Desktop


get /scratch_tmp/grp/msc_appbio/group1/outputs/fastqc_output/SRR11318268_1_fastqc.html 
get /scratch_tmp/grp/msc_appbio/group1/outputs/fastqc_output/SRR11318268_2_fastqc.html 

get /scratch_tmp/grp/msc_appbio/group1/outputs/fastqc_output/SRR11318269_1_fastqc.html 
get /scratch_tmp/grp/msc_appbio/group1/outputs/fastqc_output/SRR11318269_2_fastqc.html 

get /scratch_tmp/grp/msc_appbio/group1/outputs/fastqc_output/SRR11318270_1_fastqc.html  
get /scratch_tmp/grp/msc_appbio/group1/outputs/fastqc_output/SRR11318270_2_fastqc.html

get /scratch_tmp/grp/msc_appbio/group1/outputs/fastqc_output/SRR11318271_1_fastqc.html
get /scratch_tmp/grp/msc_appbio/group1/outputs/fastqc_output/SRR11318271_2_fastqc.html

get /scratch_tmp/grp/msc_appbio/group1/outputs/fastqc_output/multiqc_report.html

Alignment with STAR

Download the reference genome to the HPC with NCBI Datasets Command Line (CLI) tools. We require the FNA and GTF files.

# Select the correct directory.
cd /scratch_tmp/grp/msc_appbio/group1/ref_gen/

# Create a conda environment 
conda create -n ncbi_datasets

#Activate environment
conda activate ncbi_datasets

#Install ncbi-datasets-cli in the environment
conda install -c conda-forge ncbi-datasets-cli

# Download reference genome and specify the GTF (annotation file) and genome(FNA) file types.
datasets download genome accession GCF_000146045.2 --include genome,gtf

Index reference genome using STAR RNA-seq aligner

nano star_indexing.sh

#!/bin/bash
module load star/2.7.10b-gcc-13.2.0

# Ensure line 131 is run on a single line for alignment to work
STAR --runThreadN 16  --runMode genomeGenerate --genomeDir /scratch_tmp/grp/msc_appbio/group1/ref_gen --genomeFastaFiles/scratch_tmp/grp/msc_appbio/group1/ref_gen/GCF_000146045.2_R64_genomic.fna --sjdbGTFfile /scratch_tmp/grp/msc_appbio/group1/ref_gen/genomic.gtf

Script of STAR alignment for paired-end RNA reads

#!/bin/bash

echo "start of pipeline"
module load star/2.7.10b-gcc-13.2.0

# Specify path to raw data
baseDir="/scratch_tmp/grp/msc_appbio/group1"

# Define path to STAR indexed reference genome
ref_genome_index="${baseDir}/ref_gen/starindex"

rawDataDir="${baseDir}/raw_reads/"
outputDataDir="${baseDir}/outputs/star_outs/"

# Create output directory if it doesn't exist
mkdir -p "$outputDataDir"

samples=$(ls "$rawDataDir"/*_1.fastq.gz | sed 's/_1.fastq.gz//' | xargs -n 1 basename)

# Loop where samples are aligned using STAR and the genome index, saving the outputs as sorted BAM file.
for sample in $samples; do #iterates through each item in the samples variable
        echo "Processing sample: $sample" #prints a message indicating which sample is being processed

STAR --runThreadN 6 --genomeDir "$ref_genome_index" --readFilesCommand zcat --readFilesIn "$rawDataDir${sample}_1.fastq.gz" "$rawDataDir${sample}_2.fastq.gz" --outFileNamePrefix "$outputDataDir${sample}_" --outSAMtype BAM SortedByCoordinate
#ensure line 160 is run on a single line for alignment to work

done
echo "STAR alignment complete!"

Use SAMtools to inspect first 10 lines of BAM file from STAR alignment

cd /scratch_tmp/grp/msc_appbio/group1/outputs
conda activate myenv
samtools view SRR11318268_Aligned.sortedByCoord.out.bam | head -n 10

Quantification of aligned reads using featureCounts

#!/bin/bash

# Load Modules
module load subread
module load py-multiqc
echo "modules loaded successfully."

# Define variables and paths
baseDir="/scratch_tmp/grp/msc_appbio/group1"
inputDir="${baseDir}/outputs/star_outs"
gtf_file="${baseDir}/ref_gen/raw_refgen/GCF_000146045.2"
output_Dir="${baseDir}/outputs/readCounts"
feature_type="db_xref"

# Create output directory if it doesn't exist
mkdir -p "$output_Dir"

# Run featureCounts tool

featureCounts -p -a ${gtf_file}/*.gtf -g $feature_type -o ${output_Dir}/readCounts2.txt ${inputDir}/*.out.bam

# Check if featureCounts ran successfully
if [[ $? -eq 0 ]]; then
  echo "featureCounts completed successfully. Results are in $output_Dir"
else
  echo "Error: featureCounts failed. Check your input files and parameters."
exit 1
fi

# Run multiQC
multiqc "$output_Dir"

# Check if MultiQC ran successfully
if [[ $? -eq 0 ]]; then
    echo "multiqc completed successfully"
else
    echo "Error: multiqc failed"
exit 1
fi

Transfer featureCounts output file out of HPC onto local desktop using sftp.

# Open a terminal window and log into the remote server using sftp.
sftp -i~/.ssh/"key name" k********@hpc.create.kcl.ac.uk


# Change to your local directory where files will be downloaded
lcd /Users/User/group_project7BBG1002

#Use get to retrieve the readCounts2.txt and readCounts2.txt.summary files

get /scratch_tmp/grp/msc_appbio/group1/outputs/readCounts/readCounts2.txt 
get /scratch_tmp/grp/msc_appbio/group1/outputs/readCounts/readCounts2.txt.summary

Moving to Rstudio

We will process, visualise and produce plots from our readCounts2.txt file using R version 4.4.1.

Installing Packages

Installing the packages BiocManager, dplyr, gplots, ggplot2 and ggrepel.

install.packages(c('BiocManager', 'dplyr', 'gplots', 'ggplot2', 'ggrepel', 'kableExtra'))

Installing the below packages with BiocManager.

Import Read Counts File

Loading libraries. Importing dataset under “countData” variable.

# Loading libraries.

# Enhance and style tables generated with knitr::kable.
library(kableExtra)

# For differential gene expression analysis of RNA-seq data. 
library(DESeq2) 

# A package for data manipulation and transformation. 
library(dplyr) 

library(ggrepel) # For labeling purposes

# A popular package for data visualization in R.
library(ggplot2) 

# Provides various plotting functions, with a focus on heatmaps and other complex visualizations.
library(gplots) 

 # Specifically designed for creating complex heatmaps with rich annotations.
library(ComplexHeatmap)

# Provides a collection of color palettes that are suitable for various types of data visualizations.
library(RColorBrewer) 

# Used for circular visualizations, especially for displaying relationships between variables in a circular layout
library(circlize) 

# Provides methods for performing Gene Ontology (GO) statistical tests and enrichment analysis.
library(GOstats) 

# A database package providing GO (Gene Ontology) annotations.
library(GO.db) 

# Used for statistical analysis and visualization of functional profiles (e.g., GO, KEGG) of genes, proteins, or metabolites.
library(clusterProfiler) 

# A framework for storing, querying, and retrieving annotation data, such as gene symbols, IDs, and GO terms.
library(AnnotationDbi) 

# An organism-specific annotation database, in this case, for *Saccharomyces cerevisiae* (yeast).
library(org.Sc.sgd.db) 

# Versions of the packages used.

package.version('kableExtra')

## [1] "1.4.0"

package.version('DESeq2')

## [1] "1.44.0"

package.version('dplyr')

## [1] "1.1.4"

package.version('ggrepel')

## [1] "0.9.6"

package.version('ggplot2')

## [1] "3.5.2"

package.version('gplots')

## [1] "3.2.0"

package.version('ComplexHeatmap')

## [1] "2.20.0"

package.version('RColorBrewer')

## [1] "1.1-3"

package.version('circlize')

## [1] "0.4.16"

package.version('GOstats')

## [1] "2.70.0"

package.version('GO.db')

## [1] "3.19.1"

package.version('clusterProfiler')

## [1] "4.12.6"

package.version('AnnotationDbi')

## [1] "1.66.0"

package.version('org.Sc.sgd.db')

## [1] "3.19.1"

# Load our readCounts2.txt file and store it to countData parameter. Skip first row as it is metadata that we do not require.
countData = read.table('readCounts2.txt', sep = '\t', header=FALSE, skip = 1) 

# Convert to dataframe.
countData = as.data.frame(countData)

# Set column names to the values of the 1st row.
colnames(countData) = countData[1, ] 

# Remove 1st row.
countData = countData[-1, ]

# Rename our sample IDs to ReEC_2, ReEC_1, Control_2, Control_1 respectively.
colnames(countData)[colnames(countData) == "/scratch_tmp/grp/msc_appbio/group1/outputs/star_outs/SRR11318268_Aligned.sortedByCoord.out.bam"] = "ReEC_2"
colnames(countData)[colnames(countData) == "/scratch_tmp/grp/msc_appbio/group1/outputs/star_outs/SRR11318269_Aligned.sortedByCoord.out.bam"] = "ReEC_1"
colnames(countData)[colnames(countData) == "/scratch_tmp/grp/msc_appbio/group1/outputs/star_outs/SRR11318270_Aligned.sortedByCoord.out.bam"] = "Control_2"
colnames(countData)[colnames(countData) == "/scratch_tmp/grp/msc_appbio/group1/outputs/star_outs/SRR11318271_Aligned.sortedByCoord.out.bam"] = "Control_1"

# Make column Geneid our row IDs.
rownames(countData) = countData$Geneid

#Remove GenBank: from our row IDs
rownames(countData) = gsub("GenBank:", "", rownames(countData)) 

#Remove the .number from our row IDs
rownames(countData) = gsub("\\..*", "", rownames(countData)) 

# Remove columns 1 to 6.
countData = countData[,-c(1:6)]

countData[] = lapply(countData, as.numeric)

# Print head of countData. 
head(countData)

##              ReEC_2 ReEC_1 Control_2 Control_1
## NM_001180043     34     56        46        31
## NM_001184582      0      0         0         0
## NM_001178208      0      0         4         0
## NM_001179897      4     18        10        16
## NM_001180042     51     80        55        63
## NM_001180041      0      0         0         0

# Load Paper_Results.csv for comparison later.
Paper_results = read.csv("./Paper_results.csv", sep = ';', stringsAsFactors = FALSE)

# Convert specific columns to numeric and replace decimal point separator from ',' to '.'
Paper_results$Differential.Expression.Adjusted.p.value = as.numeric(gsub(",", ".",Paper_results$Differential.Expression.Adjusted.p.value))

Paper_results$Differential.Expression.Log2.Ratio = as.numeric(gsub(",", ".",Paper_results$Differential.Expression.Log2.Ratio))

Paper_results$Differential.Expression.p.value = as.numeric(gsub(",", ".",Paper_results$Differential.Expression.p.value))

# Display structure of the dataframe to verify format.
str(Paper_results)

## 'data.frame':    5888 obs. of  6 variables:
##  $ Name                                    : chr  "HXT13 CDS" "HXT17 CDS" "DSF1 CDS" "hypothetical protein CDS" ...
##  $ Differential.Expression.Adjusted.p.value: num  1.46e-60 5.56e-42 3.15e-31 3.15e-31 3.15e-31 ...
##  $ Differential.Expression.Log2.Ratio      : num  2.95 2.69 2.15 2.11 -1.92 ...
##  $ Differential.Expression.p.value         : num  2.46e-64 1.87e-45 2.65e-34 2.40e-34 1.84e-34 ...
##  $ gene                                    : chr  "HXT13" "HXT17" "DSF1" "" ...
##  $ locus_tag                               : chr  "YEL069C" "YNR072W" "YEL070W" "YNR073C" ...

# Make column locus_tag as rownames and delete redundant locus_tag column afterwards.
rownames(Paper_results) = Paper_results$locus_tag
Paper_results$locus_tag = NULL

Quick Data Exploration

Summary of our dataset.

summary(countData)

##      ReEC_2            ReEC_1          Control_2         Control_1     
##  Min.   :      0   Min.   :      0   Min.   :      0   Min.   :     0  
##  1st Qu.:   2020   1st Qu.:   2154   1st Qu.:   1840   1st Qu.:  1971  
##  Median :   4362   Median :   4572   Median :   4084   Median :  4247  
##  Mean   :   9805   Mean   :  10047   Mean   :   9510   Mean   :  9790  
##  3rd Qu.:   8372   3rd Qu.:   8805   3rd Qu.:   8054   3rd Qu.:  8498  
##  Max.   :1274145   Max.   :1266216   Max.   :1345039   Max.   :873397

Read sums of read counts per sample.

colSums(countData)

##    ReEC_2    ReEC_1 Control_2 Control_1 
##  63505413  65071717  61593777  63411294

Barplot of our data (million counts per sample).

# Fix margins of our output.
par(mar=c(8,5,4,2)+0.1) 

# Create barplot.
barplot(colSums(countData)/1e6, main = "Total Read Counts", xlab = "Samples", ylab = "Counts per Million")

Histogram of the first column of our dataset showing that the data are highly skewed to the right.

# Fix margins of our output.
par(mar=c(7,5,4,4)+0.2) 

# Create histogram.
hist(countData$ReEC_2, xlim = c(0,120000), main = "Histogram of read counts for ReEC_2", xlab = "Number of read counts", br=1000)

Log Transformation. Taking the log2 for our sample read counts for better visualization.

logCountData = log2(1 + countData)

Repeating our previous histogram of the first column of our dataset, but now with the log transformed data.

# Fix margins of our output.
par(mar=c(7,5,4,4)+0.2)

# Create histogram with the log2 read counts
hist(logCountData$ReEC_2, xlim = c(0,20), main = "Histogram of read counts for ReEC_2", xlab = "Number of read counts", br=100)

Plotting Scatter Plot to observe the correlation between ReEC_2 and ReEC_1.

# Fix margins of our output.
par(mar=c(7,5,4,2)+0.1)

# Create scatter plot.
plot(logCountData[,1], logCountData[,2], xlab = "ReEC_2", ylab = "ReEC_1")

Plotting Scatter Plot to observe the correlation between Control_2 and Control_1.

# Fix margins of our output.
par(mar=c(7,5,4,2)+0.1)

# Create scatter plot.
plot(logCountData[,3], logCountData[,4], xlab = "Control_2", ylab = "Control_1")

Plotting Scatter Plot to observe the correlation between ReEC_2 and Control_2.

# Fix margins of our output.
par(mar=c(7,5,4,2)+0.1)

# Create scatter plot.
plot(logCountData[,1], logCountData[,3], xlab = "ReEC_2", ylab = "Control_2")

Normalization, and Trasformation using DESeq2

We have to make the experiment design into a small dataframe to associate each column (sample) in the countData object with a specific condition (mutated or control). Here we make a small table that has the sample/column names from our main dataset (countData) as row; and then a column for which columns of our main dataset are “treatment” or “control” respectively. This step is mandatory for DESeq2 package to properly analyse our data.

# Define our groups and indicate that object "group" is a factor.
groups = factor(c("mutated", "mutated", "control", "control")) 

# Create dataframe
sample_info = data.frame(row.names = colnames(countData), condition = groups)
sample_info

##           condition
## ReEC_2      mutated
## ReEC_1      mutated
## Control_2   control
## Control_1   control

Create a DESeqDataSet object (dds) using the DESeqDataSetFromMatrix function from the DESeq2 package. This dataset is then ready for downstream differential expression analysis.

# Create DESeq2 dataset
dds = DESeqDataSetFromMatrix(countData = countData, colData = sample_info, design = ~ condition)

Run the DESeq2 pipeline for differential expression analysis. Calculate significantly expressed genes using adjusted p-value cutoff of 0.01.

# Main function
dds = DESeq(dds)

# Confirm the number of rows is the equal with our countData dataset. 
nrow(dds)

## [1] 6477

# Results. The contrast argument specifies which groups to compare (in this case, mutated vs control). Default adjusted p-value is 0.1 but we set it to 0.01 based on the study we are trying to reproduce.
res = results(dds, contrast = c("condition", "mutated", "control"), alpha = 0.01) 

# Sorting the res table by the adjusted p-values (padj) in ascending order.
res = res[order(res$padj),] 

# Make a dataframe of res, so we can use it later to make a volcano plot.
res.df = as.data.frame(res)

# Remove N/A values.
res.df = na.omit(res.df) 

# Print head of res. 
head(res)

## log2 fold change (MLE): condition mutated vs control 
## Wald test p-value: condition mutated vs control 
## DataFrame with 6 rows and 6 columns
##               baseMean log2FoldChange     lfcSE      stat      pvalue
##              <numeric>      <numeric> <numeric> <numeric>   <numeric>
## NM_001178884   917.059        3.38035  0.212988   15.8711 1.00469e-56
## NM_001183250  3428.184        2.44669  0.185474   13.1916 9.81262e-40
## NM_001183249   397.737        3.09408  0.243089   12.7282 4.12390e-37
## NM_001178885  4439.225        2.45244  0.194268   12.6240 1.55681e-36
## NM_001179887 46228.818       -2.12102  0.177082  -11.9776 4.65538e-33
## NM_001179459  3885.273        2.12698  0.186749   11.3895 4.71450e-30
##                     padj
##                <numeric>
## NM_001178884 6.24315e-53
## NM_001183250 3.04878e-36
## NM_001183249 8.54197e-34
## NM_001178885 2.41850e-33
## NM_001179887 5.78571e-30
## NM_001179459 4.88265e-27

# Print summary of res.
summary(res)

## 
## out of 6337 with nonzero total read count
## adjusted p-value < 0.01
## LFC > 0 (up)       : 112, 1.8%
## LFC < 0 (down)     : 94, 1.5%
## outliers [1]       : 0, 0%
## low counts [2]     : 123, 1.9%
## (mean count < 4)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

We will now compare our results res.df to the study we are trying to reproduce. We will first create a subset of res.df with all the genes that present adjusted p-values(padj) < 0.01 and store it to sigGenes object. This subset will contain all the significant differentially expressed genes based on padj threshold of 0.01.

# Create subset of res.df.  
sigGenes = res.df[res.df$padj < 0.01,]

# Remove NA values. 
sigGenes = na.omit(sigGenes)

# Print dimentions of SigsPaper
dim(sigGenes)

## [1] 206   6

Filtering Paper_results for significant differentially expressed genes based on a adjusted p-value cutoff of 0.01, and storing it to SigsPaper object.

# Extract subset of significant differentially expressed genes from Paper_results.
SigsPaper = Paper_results[Paper_results$Differential.Expression.Adjusted.p.value < 0.01,]

SigsPaper = na.omit(SigsPaper)

# Print dimentions of SigsPaper
dim(SigsPaper)

## [1] 194   5

Now we will try to map our results IDs of differentially expressed genes (sigGenes) with paper’s ones (SigsPaper). To achieve this we will use org.Sc.sgd.db as a database to retrieve common IDs in both papers for comparison, since our dataset’s IDs differ from the paper’s ones. Therefore, we will make an extra column in sigGenes of COMMON IDs. We retrieve these IDs from org.Sc.sgd.db based on mapping to our row names (RefSeq ID type).

# Make COMMON column and map RefSeq IDs to ENTREZ from org.Sc.sgd.db 
sigGenes$COMMON = mapIds(org.Sc.sgd.db, keys = rownames(sigGenes), column = "COMMON", keytype = "REFSEQ", multiVals = "first")

## 'select()' returned 1:many mapping between keys and columns

# Remove NA values from sigGenes.
sigGenes = na.omit(sigGenes)

# Search for presence of common IDs in the "COMMON" column of sigGenes to the "gene" column of SigsPaper.
CommonIDs = intersect(sigGenes$COMMON, SigsPaper$gene)

# Make CommonIDs a dataframe.
CommonIDs = data.frame(CommonIDs)

# Remove NA values from CommonIDs.
CommonIDs = na.omit(CommonIDs)

# Print CommonIDs dimensions.
dim(CommonIDs)

## [1] 106   1

# Print CommonIDs as a scroll table.
kable(CommonIDs) %>% 
  kable_styling() %>%
  scroll_box(width = "500px", height = "500px")

CommonIDs
DSF1
COX5B
DAK2
RTS3
CWP1
CTT1
MAL12
IZH4
IZH2
SRD1
STB6
DSE2
CTS1
PUS2
ACO1
MLS1
MAL32
MEP3
PUT4
ESF1
TDA4
ICT1
CLN2
HXT10
SUC2
GAC1
ARG1
ISF1
YET2
MAL31
FUM1
FIT2
PAU7
IRC10
ICL1
COS12
HES1
AMN1
SRL1
CLN1
SPS100
FAS1
DIF1
ARO9
AXL2
SVS1
TOS6
GLK1
SPI1
ECM13
HSP30
EMI2
MTL1
ARG7
TOS1
PDH1
PRY1
ODC1
ENO2
SCW4
MSB2
DSE1
BAG7
PRM7
ARG3
MGA2
IDH1
KGD2
COR1
YRO2
IDH2
RKI1
SCW10
DBF20
MSA2
CIT2
DIT1
YAT2
ECL1
KCC4
CYB2
TAD2
SET4
ATP1
CYC1
HRP1
CHA1
SPS4
EGT2
SRL4
CIT1
IML2
MRK1
STL1
COX8
DSE3
IZH1
YPC1
TOS4
BUD7
ECM10
SUT1
ALG3
TCB3
CAX4
BSC1

By default, lfcThreshold is set to 0. The paper doesn’t specify a log fold change (LFC) cutoff and we assume that DESeq2 default values were used. However, typical cutoffs of 1.3 to 4 fold change (FC) are applied along with the FDR threshold in most studies for a more accurate prediction of the differentially expressed genes. If we apply a log2 fold change threshold of 0.379 (~1.3 FC) we can see that we get even less differentially expressed genes (significant) than the numbers mentioned in the paper.

# Results. The contrast argument specifies which groups to compare (in this case, mutated vs control). Default adjusted p-value is 0.1 but we set it to 0.01 based on the study we are trying to reproduce. Log2 fold change threshold (lfcThreshold) is set to 0.379.
res.strict = results(dds, contrast = c("condition", "mutated", "control"), lfcThreshold = 0.379, alpha = 0.01)

# Sorting the res table by the adjusted p-values (padj) in ascending order.
res.strict = res.strict[order(res.strict$padj),] 

# Make a dataframe of res.strict, so we can use it later to make a volcano plot.
res.strict.df = as.data.frame(res.strict)

# Remove N/A values
res.strict.df = na.omit(res.strict.df)

# Print a summary of res.strict
summary(res.strict)

## 
## out of 6337 with nonzero total read count
## adjusted p-value < 0.01
## LFC > 0.38 (up)    : 46, 0.73%
## LFC < -0.38 (down) : 21, 0.33%
## outliers [1]       : 0, 0%
## low counts [2]     : 0, 0%
## (mean count < 0)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

Filtering our differentially expressed genes based on log2FoldChange cutoff of > 1.6 and adjusted p-value of < 0.01 to reduce the number of genes and extract the top differentially expressed genes for better visualization in a heatmap.

# Select genes from res.df with absolut log2FoldChange values > 1.6 and adjusted p-values < 0.01.
df.top = res.df[(abs(res.df$log2FoldChange) > 1.6) & (res.df$padj < 0.01),]

# Print df.top as dataframe with scroll bars.
kable(df.top) %>% 
  kable_styling() %>%
  scroll_box(width = "1000px", height = "800px")

	baseMean	log2FoldChange	lfcSE	stat	pvalue	padj
NM_001178884	917.05889	3.380351	0.2129879	15.871097	0.00e+00	0.0000000
NM_001183250	3428.18447	2.446690	0.1854738	13.191565	0.00e+00	0.0000000
NM_001183249	397.73706	3.094082	0.2430893	12.728170	0.00e+00	0.0000000
NM_001178885	4439.22483	2.452436	0.1942675	12.624013	0.00e+00	0.0000000
NM_001179887	46228.81813	-2.121023	0.1770822	-11.977612	0.00e+00	0.0000000
NM_001179459	3885.27306	2.126981	0.1867486	11.389544	0.00e+00	0.0000000
NM_001179914	2699.94731	3.176047	0.3106209	10.224835	0.00e+00	0.0000000
NM_001184636	2947.87015	2.334704	0.2285380	10.215827	0.00e+00	0.0000000
NM_001181290	33333.83030	2.394363	0.2363183	10.131942	0.00e+00	0.0000000
SGD:S000007261	9212.24249	-2.305998	0.2309464	-9.984989	0.00e+00	0.0000000
GeneID:851854	1093.98745	2.006927	0.2126872	9.436049	0.00e+00	0.0000000
NM_001179662	4276.57274	-2.241660	0.2428992	-9.228767	0.00e+00	0.0000000
NM_001181217	29336.04432	-2.201083	0.2491844	-8.833149	0.00e+00	0.0000000
GeneID:856259	899.78828	1.737558	0.1970967	8.815763	0.00e+00	0.0000000
NM_001179913	10266.03905	1.741138	0.2051681	8.486399	0.00e+00	0.0000000
NM_001181421	20899.72028	2.858331	0.3415470	8.368779	0.00e+00	0.0000000
NM_001184567	1662.67650	-1.648336	0.1993342	-8.269211	0.00e+00	0.0000000
NM_001183390	1947.59422	1.859596	0.2320836	8.012616	0.00e+00	0.0000000
NM_001183355	7751.96132	1.849258	0.2318568	7.975865	0.00e+00	0.0000000
NM_001181418	74442.41415	2.278568	0.2882363	7.905211	0.00e+00	0.0000000
SGD:S000007262	7327.80280	-1.702045	0.2179335	-7.809928	0.00e+00	0.0000000
GeneID:851712	39909.06205	-2.177790	0.2812944	-7.742030	0.00e+00	0.0000000
NM_001178647	7998.68498	2.244825	0.3341334	6.718351	0.00e+00	0.0000000
NM_001183191	1449.40624	1.752945	0.2687778	6.521913	0.00e+00	0.0000000
NM_001181468	124956.73388	1.623340	0.2613041	6.212455	0.00e+00	0.0000001
NM_001178666	764.53425	1.617183	0.2769976	5.838258	0.00e+00	0.0000006
NM_001183127	7087.03138	-1.815860	0.3197033	-5.679827	0.00e+00	0.0000014
GeneID:850925	322.58216	1.676340	0.3056030	5.485354	0.00e+00	0.0000039
NM_001179269	403.14184	-1.671766	0.3379697	-4.946495	8.00e-07	0.0000559
GeneID:850341	62.83073	2.348477	0.5163569	4.548166	5.40e-06	0.0003329
GeneID:851849	69.35399	1.640244	0.4030889	4.069188	4.72e-05	0.0020940

Order df.top by the log2foldchange (Higher to lower values) for better visualization in the heatmap.

# Reorder the rows of df.top based on the log2FoldChange column in descending order.
df.top = df.top[order(df.top$log2FoldChange, decreasing = TRUE),]

Exploratory Data Analysis

Heatmap with ComplexHeatmap, RColorBrewer and circlize libraries

# Apply a regularized log transformation (rlog) to the count data in the dds object.The blind = FALSE argument indicates that the transformation will not ignore the experimental design.
rld = rlog(dds, blind=FALSE) 

# Making a matrix of the log-transformed expression values for the top differentially expressed genes across all samples.
mat = assay(rld)[rownames(df.top), rownames(sample_info)]

# Ensure the column names of mat correspond to the sample names from the sample_info dataframe.
colnames(mat) = rownames(sample_info)

# Calculate the mean expression for each gene (across all samples). This give us an idea of the overall expression level of each gene across the samples in the heatmap.
base_mean = rowMeans(mat)

# Center and scale each column (Z-score), then transpose.
mat.scaled = t(apply(mat, 1, scale)) 
colnames(mat.scaled) = colnames(mat)
head(mat.scaled)

##                 ReEC_2    ReEC_1  Control_2  Control_1
## NM_001178884 0.9135980 0.8165386 -0.8341066 -0.8960300
## NM_001179914 0.5256830 1.1453589 -0.7471479 -0.9238941
## NM_001183249 0.9028519 0.8283980 -0.8645939 -0.8666560
## NM_001181421 0.6262600 1.0394192 -1.0975682 -0.5681111
## NM_001178885 0.8838570 0.8426861 -0.7678736 -0.9586695
## NM_001183250 0.8776038 0.8515029 -0.7943780 -0.9347288

Making a matrix of the Log2FoldChange values of the filtered genes in df.top, to use it as an extra column in our heatmap.

# Get log2FoldChange value for each gene in df.top.
l2_val = as.matrix(df.top$log2FoldChange) 

# Assign column name to the matrix.
colnames(l2_val) = "log2FC"

l2_val

##          log2FC
##  [1,]  3.380351
##  [2,]  3.176047
##  [3,]  3.094083
##  [4,]  2.858331
##  [5,]  2.452436
##  [6,]  2.446690
##  [7,]  2.394363
##  [8,]  2.348477
##  [9,]  2.334704
## [10,]  2.278569
## [11,]  2.244825
## [12,]  2.126981
## [13,]  2.006927
## [14,]  1.859596
## [15,]  1.849258
## [16,]  1.752945
## [17,]  1.741138
## [18,]  1.737558
## [19,]  1.676340
## [20,]  1.640244
## [21,]  1.623340
## [22,]  1.617183
## [23,] -1.648336
## [24,] -1.671766
## [25,] -1.702045
## [26,] -1.815860
## [27,] -2.121022
## [28,] -2.177790
## [29,] -2.201083
## [30,] -2.241660
## [31,] -2.305998

Making a matrix of the baseMean values (Average gene expression across all samples) of the filtered genes in df.top to include as an extra column in our heatmap.

# Make matrix of the mean values for each gene in df.top.
mean = as.matrix(df.top$baseMean) 

# Assign column name to the matrix.
colnames(mean) = "AvrgExpr"

mean

##           AvrgExpr
##  [1,]    917.05889
##  [2,]   2699.94731
##  [3,]    397.73706
##  [4,]  20899.72028
##  [5,]   4439.22483
##  [6,]   3428.18447
##  [7,]  33333.83030
##  [8,]     62.83073
##  [9,]   2947.87015
## [10,]  74442.41415
## [11,]   7998.68498
## [12,]   3885.27306
## [13,]   1093.98745
## [14,]   1947.59422
## [15,]   7751.96132
## [16,]   1449.40624
## [17,]  10266.03905
## [18,]    899.78828
## [19,]    322.58216
## [20,]     69.35399
## [21,] 124956.73388
## [22,]    764.53425
## [23,]   1662.67650
## [24,]    403.14184
## [25,]   7327.80280
## [26,]   7087.03138
## [27,]  46228.81813
## [28,]  39909.06205
## [29,]  29336.04432
## [30,]   4276.57274
## [31,]   9212.24249

Creating a color map for our Log2FoldChange and Average gene expression values.

# Define color scale for Z-scores, mapping low values to blue, zero to white, and high values to red.
col_z = colorRamp2(c(min(mat.scaled), 0, max(mat.scaled)), c("blue", "white", "red"))

# Maps values between b/w/r for min and max log2FoldChange values.
col_log2FC = colorRamp2(c(min(l2_val),0, max(l2_val)), c("blue", "white", "red"))

# Maps between 0% quantile, and 75% quantile of mean values --- 0, 25, 50, 75, 100.
col_AvrgExpr = colorRamp2(c(quantile(mean)[1], quantile(mean)[4]), c("white", "red"))

Creating a heatmap for the top 31 genes, consisting of 3 smaller heatmaps/columns that present the Z-score expression, the log2FoldChange in descending order, and the baseMean (Average gene expression) matrices respectively.

# Create an annotation that will be used later to summarise the l2FC column of the heatmap. ha goes as input to h2 (heatmap 2). 
ha = HeatmapAnnotation(summary = anno_summary(gp = gpar(fill = 2), 
                                               height = unit(2, "cm")))
# Create heatmap 1 (Z-scores).
h1 = Heatmap(mat.scaled, cluster_rows = F, 
            column_labels = colnames(mat.scaled), column_names_gp = gpar(fontsize = 9), name="Z-score",
            cluster_columns = T, col = col_z, 
             show_heatmap_legend = FALSE)

# Create heatmap 2 (log2 Fold Change).
h2 = Heatmap(l2_val, row_labels = rownames(df.top), 
            cluster_rows = F, name="log2FC", top_annotation = ha, col = col_log2FC, show_heatmap_legend = FALSE, # Define color scheme for log2FC
            row_names_gp = gpar(fontsize = 8), column_names_gp = gpar(fontsize = 9), cell_fun = function(j, i, x, y, w, h, col) { # Add text to each grid.
              grid.text(round(l2_val[i, j],2), x, y, gp = gpar(fontsize = 7))
            }) # Round text to 2 decimals.

# Create heatmap 3 (Average Expression).
h3 = Heatmap(mean, row_labels = rownames(df.top), 
            cluster_rows = F, name = "AvrgExpr", col = col_AvrgExpr, show_heatmap_legend = FALSE,
            row_names_gp = gpar(fontsize = 7), column_names_gp = gpar(fontsize = 9), cell_fun = function(j, i, x, y, w, h, col) { # Add text to each grid.
              grid.text(round(mean[i, j],2), x, y, gp = gpar(fontsize = 7))
            }) # Round text to 2 decimals.

# Combine the heatmaps into one (h).
h = h1+h2+h3

# Extracting heatmap (h) into a PNG file. "png()" initializes the graphics device for saving the plot as a PNG file.
# png("./heatmap.png", res = 300, width = 2700, height = 5000)


# Create legends with spacing between title and color bar
lgd_list = list(
  Legend(title = "Z-score", col_fun = col_z, title_gap = unit(4, "mm")),
  Legend(title = "log2FC", col_fun = col_log2FC, title_gap = unit(4, "mm")),
  Legend(title = "AvrgExpr", col_fun = col_AvrgExpr, title_gap = unit(4, "mm"))
)

# Pack legends vertically with spacing between them
combined_legend = packLegend(
  lgd_list[[1]],
  lgd_list[[2]],
  lgd_list[[3]],
  direction = "vertical",
  gap = unit(10, "mm") # Increase gap between the legends
)

# Draw heatmap with packed legends on the right and apply padding
draw(h, 
     heatmap_legend_list = combined_legend, 
     heatmap_legend_side = "right", 
     padding = unit(c(10, 10, 10, 10), "mm"))

# "dev.off()" closes the graphics device, freeing up memory.
#dev.off()

Volcano Plot with ggplot2 and ggrepel

We will initially create a basic volcano plot step by step from res.df that takes the default lfcThreshold (0) and only defines the significant differentially expressed genes based on the adjusted p-value (padj < 0.01).

# Set our x-axis and y-axis values to log2FoldChage and -log10(padj) respectively. We convert padj values to the scale of -log10 for better visualisation. "geom_point()" adds scatter plot points to the plot.
vol_plot = res.df %>%
  ggplot(aes(x = log2FoldChange, y = -log10(padj))) + geom_point()

# Visualise ggplot output
vol_plot

Adding horizontal and vertical lines based on our padj and log2FoldChange cutoff values.

We have set padj cutoff at 0.01 and since our y-axis is in -log10 scale we also need to convert our threshold to -log10. Our log2FoldChange (L2FC) cutoff is set at 0. This defines which genes are upregulated (L2FC > 0) and which are downregulated (L2FC < 0). A L2FC of 0 equals to fold change of 1. We will addWe have also set our x-axis range from -6 to 6.

# Add threshold lines for both statistical significance (y-axis) and fold change (x-axis), and adjust the x-axis scale.
vol_plot + geom_hline(yintercept = -log10(0.01), linetype = "dashed") +
  geom_vline(xintercept = log2(1), linetype = "dashed") + scale_x_continuous(breaks = c(seq(-6, 6, 1)), # Modify x-axis tick intervals.
                                                                                            limits = c(-6, 6)) + # Modify x-axis range from -6 to 6.
  ggtitle("ReEC(mutated) vs Control") + # Add title.
  theme(plot.title = element_text(size = 11, face = "bold", hjust = 0.5)) + # Format title.
  labs(subtitle = "(fold change = 0, adjusted p-value = 0.01)") + # Add subtitle.
  theme(plot.subtitle = element_text(size = 8, face = "italic", hjust = 0.5)) # Format subtitle.

Add point colour, size and transparency

To visualise different groups of genes using different colours, point sizes, shapes or transparencies, we need to categorise genes into different groups and store these categories as a new parameter i.e. new column of data.

Genes with log2FoldChage > 0 & adj_p_val <= 0.01 as Upregulated.

Genes with log2FoldChage < 0 & adj_p_val <= 0.01 as Downregulated.

All other genes are labelled as Non-Significant.

# Create a new dataframe(res.df2) based on the original res.df and add a new column(gene_type). The new gene_type column categorizes each row (gene) into one of these three groups: Upregulated, Downregulated, Non-Significant; based on the log2FoldChange(0) and the padj(0.01) cutoffs. 
res.df2 = res.df %>%
  mutate(
    gene_type = case_when(
      log2FoldChange > 0 & padj <= 0.01 ~ "Upregulated",
      log2FoldChange < 0 & padj <= 0.01 ~ "Downregulated",
      TRUE ~ "Non-Significant")) 

# Extract the gene_type column from the new dataframe(res.df2) and store it in a variable.
gene_type.object = res.df2$gene_type  

# Summarize the counts of each category in the gene_type column.
res.df2 %>%
  count(gene_type)

##         gene_type    n
## 1   Downregulated   94
## 2 Non-Significant 6008
## 3     Upregulated  112

# Check gene_type category names.
res.df2 %>%
  distinct(gene_type) %>%
  pull()

## [1] "Upregulated"     "Downregulated"   "Non-Significant"

In ggplot2, you also have the option to visualize different groups by point color, size, shape and transparency by modifying parameters via scale_color_manual() etc. A tidy way of doing this is to store each visual specifications in a separate vector.

# Create vectors of custom colors and alphas (transparency) to volcano plot.
cols = c("Upregulated" = "#ffad73", "Downregulated" = "#26b3ff", "Non-Significant" = "grey") 
alphas = c("Upregulated" = 1, "Downregulated" = 1, "Non-Significant" = 0.5)

# Create a column in res.df2 that takes the rownames of our 20 most significant differentially expressed genes that will be used later for labeling purposes. This is possible as res.df2 is sorted in ascending order of the padj values.
res.df2 = res.df2 %>% mutate(top_20_genes = ifelse(row_number() <= 20, rownames(res.df2), NA))

# Create Volcano plot in the same way we did in the previous steps.
VolcanoPlot = ggplot( res.df2, aes(x = log2FoldChange,
             y = -log10(padj),
             fill = gene_type,
             size = gene_type,
             alpha = gene_type)) + 
  geom_point(shape = 21,     
             colour = "black", size = 2) + # Plot the points and modify color and size.
  geom_hline(yintercept = -log10(0.01),
             linetype = "dashed") + # Add dashed line in the y-axis at -log(0.01).
  geom_vline(xintercept = log2(1), 
             linetype = "dashed") + # Add dashed line in the x-axis at log2(1) fold change.
  scale_fill_manual(values = cols) + # Modify point color
  scale_alpha_manual(values = alphas) + # Modify point transparency
  scale_x_continuous(breaks = c(seq(-6, 6, 1)), # Modify x-axis tick intervals.
                     limits = c(-6, 6)) + # Modify x-axis range from -6 to 6.
  ggtitle("ReEC(mutated) vs Control") + # Add title.
  theme(plot.title = element_text(size = 11, face = "bold", hjust = 0.5)) + # Format title.
  labs(subtitle = "(fold change = 0, adjusted p-value = 0.01)") + # Add subtitle.
  theme(plot.subtitle = element_text(size = 8, face = "italic", hjust = 0.5)) + # Formt subtitle.
  geom_text_repel(aes(label = top_20_genes), # Add labels to the top 20 most significant differentially expressed genes.
                  size = 2, 
                  max.overlaps = Inf)

# Extracting Vol_plot2 into a PNG file and saving it to our current directory.
#png("./VolcanoPlot.png", res = 300, width = 4100, height = 2100)

# Renders the VolcanoPlot and writes it to the PNG file.
print(VolcanoPlot)

# "dev.off()" closes the graphics device, freeing up memory.
#dev.off() # Turning off graphics device to free up memory.

Adding Log2 Fold Change cutoffs

Now we will create a volcano plot from res.strict.df that takes a lfcThreshold of 0.379(1.3 fold change), and padj of 0.01. Therefore, a gene is defined as differentially expressed (significant) only if |lfcThreshold| > 0.379 and padj <= 0.01.

# Create a new dataframe(res.strict.df2) based on the original res.strict.df and add a new column(gene_category). The new gene_category column categorizes each row (gene) into one of these three groups: Upregulated, Downregulated, Non-Significant; based on the |log2FoldChange| > 0.379 and the padj(0.01) cutoffs. 
res.strict.df2 = res.strict.df %>%
  mutate(
    gene_category = case_when(
      log2FoldChange >= 0.379 & padj <= 0.01 ~ "Upregulated",
      log2FoldChange <= -0.379 & padj <= 0.01 ~ "Downregulated",
      TRUE ~ "Non-Significant")) 

# Extract the gene_category column from the new dataframe(res.strict.df2) and store it in a variable.
gene_category.object = res.df2$gene_category  

# Summarize the counts of each category in the gene_category column.
res.strict.df2 %>%
  count(gene_category)

##     gene_category    n
## 1   Downregulated   21
## 2 Non-Significant 6270
## 3     Upregulated   46

# Check gene_category category names.
res.strict.df2 %>%
  distinct(gene_category) %>%
  pull()

## [1] "Upregulated"     "Downregulated"   "Non-Significant"

# Create vectors of custom colors and alphas (transparency) to volcano plot.
cols2 = c("Upregulated" = "#ffad73", "Downregulated" = "#26b3ff", "Non-Significant" = "grey") 
alphas2 = c("Upregulated" = 1, "Downregulated" = 1, "Non-Significant" = 0.5)

# Create a column in res.strict.df2 that takes the rownames of our 20 most significant differentially expressed genes that will be used later for labeling purposes. This is possible as res.strict.df2 is sorted in ascending order of the padj values.
res.strict.df2 = res.strict.df2 %>% mutate(top_20genes = ifelse(row_number() <= 20, rownames(res.df2), NA))

# Create Volcano plot in the same way we did in the previous steps.
VolcanoPlot_strict = ggplot( res.strict.df2, aes(x = log2FoldChange,
             y = -log10(padj),
             fill = gene_category,
             size = gene_category,
             alpha = gene_category)) + 
  geom_point(shape = 21,     
             colour = "black", size = 2) +  # Plot the points and modify color and size.
  geom_hline(yintercept = -log10(0.01),
             linetype = "dashed") + # Add dashed line in the y-axis at -log(0.01).
  geom_vline(xintercept = c(log2(1.3), -log2(1.3)), 
             linetype = "dashed") + # Add dashed line in the x-axis at log2(1.3) and -log2(1.3) fold change respectively.
  scale_fill_manual(values = cols2) + # Modify point color
  scale_alpha_manual(values = alphas2) + # Modify point transparency
  scale_x_continuous(breaks = c(seq(-6, 6, 1)), # Modify x-axis tick intervals.
                     limits = c(-6, 6)) + # Modify x-axis range from -6 to 6.
  ggtitle("ReEC(mutated) vs Control") + # Add title.
  theme(plot.title = element_text(size = 11, face = "bold", hjust = 0.5)) + # Format title.
  labs(subtitle = "(fold change = 1.3, adjusted p-value = 0.01)") + # Add subtitle.
  theme(plot.subtitle = element_text(size = 8, face = "italic", hjust = 0.5)) + # Formt subtitle.
  geom_text_repel(aes(label = top_20genes), # Add labels to the top 20 most significant differentially expressed genes.
                  size = 2, 
                  max.overlaps = Inf)

# Extracting VolcanoPlot_strict into a PNG file and saving it to our current directory.
#png("./VolcanoPlot_strict.png", res = 300, width = 4100, height = 2100)

# Renders the VolcanoPlot_strict and writes it to the PNG file.
print(VolcanoPlot_strict)

# "dev.off()" closes the graphics device, freeing up memory. 
#dev.off()

PCA plot with ggplot2 and ggrepel.

Making a function to detect our groups selection (mutated, control).

detectGroups = function(x) { return(groups) }

# Perform PCA to the regularized log transformed data. 
pca.object = prcomp(t(assay(rld)))  


  # Data preparation for plotting.

# Select the first two columns of pca.object which contains the principal component scores.
pcaData = as.data.frame(pca.object$x[,1:2]); 

# Add groups as a "Type" column in pcaData 
pcaData = cbind(pcaData, detectGroups(colnames(assay(rld)) ))

# Add column names
colnames(pcaData) = c("PC1", "PC2", "Type") 

# Calculate the percentage of variance explained by the first two principal components (PC1 and PC2). Use the summary(pca.object)$importance table to get the proportion of variance explained for each principal component and convert it to a percentage.
percentVar=round(100*summary(pca.object)$importance[2,1:2],0)


  # Plotting the PCA Results

# Initialize the ggplot object with the PCA data. Color and shape the points by the Type variable (which indicates the sample group).
p=ggplot(pcaData, aes(PC1, PC2, color=Type, shape = Type)) + geom_point(size=3)

# Label the x-axis and y-axis respectively with the percentages of variance explained by PC1 and PC2.
p=p+xlab(paste0("PC1: ",percentVar[1],"% variance")) 
p=p+ylab(paste0("PC2: ",percentVar[2],"% variance")) 

# Add title and ensure that the aspect ratio of the plot is equal, so the axes have the same scale.
p=p+ggtitle("Principal component analysis (PCA)") + coord_fixed(ratio=1.0) + 
    theme(plot.title = element_text(size = 12,hjust = 0.5)) + theme(aspect.ratio = 1) +
    theme(axis.text.x = element_text( size = 10),
    axis.text.y = element_text( size = 10),
    axis.title.x = element_text( size = 11),
    axis.title.y = element_text( size = 11) ) +
  theme(legend.text=element_text(size=11))

# Display plot
print(p)

GO Term Enrichment Analysis with clusterProfiler

Making four extra columns in sigGenes of ENTREZ, GENENAME, ENSEMBL and GO IDs. We retrieve these IDs from org.Sc.sgd.db based on mapping to our row names (RefSeq ID type).

# Make ENTREZ column and map RefSeq IDs to ENTREZ from org.Sc.sgd.db.
sigGenes$ENTREZ = mapIds(org.Sc.sgd.db, keys = rownames(sigGenes), column = "ENTREZID", keytype = "REFSEQ", multiVals = "first")

## 'select()' returned 1:1 mapping between keys and columns

# Make GENENAME column and map RefSeq IDs to GENENAME from org.Sc.sgd.db.
sigGenes$GENENAME = mapIds(org.Sc.sgd.db, keys = rownames(sigGenes), column = "GENENAME", keytype = "REFSEQ", multiVals = "first")

## 'select()' returned 1:1 mapping between keys and columns

# Make ENSEMBL column and map RefSeq IDs to ENSEMBL from org.Sc.sgd.db.
sigGenes$ENSEMBL = mapIds(org.Sc.sgd.db, keys = rownames(sigGenes), column = "ENSEMBL", keytype = "REFSEQ", multiVals = "first")

## 'select()' returned 1:many mapping between keys and columns

# Make GO column and map RefSeq IDs to GO from org.Sc.sgd.db.
sigGenes$GO = mapIds(org.Sc.sgd.db, keys = rownames(sigGenes), column = "GO", keytype = "REFSEQ", multiVals = "first")

## 'select()' returned 1:many mapping between keys and columns

Identifying upregulated and downregulated genes based on the log2FoldChange thresholds of (-0.379, 0.379), and extracting their corresponding ENTREZ IDs. Then, we store these genes to their respective objects.

# Filter upregulated genes from sigGenes if log2FoldChange >= 0.379 and select the ENTREZ column from sigGenes. Store this to Upregulated_genes object. 
Upregulated_genes = sigGenes[sigGenes$log2FoldChange >= 0.379, "ENTREZ"]

# Remove NA values.
Upregulated_genes = na.omit(Upregulated_genes)

# Filter downregulated genes from sigGenes if log2FoldChange <= -0.379 and select the ENTREZ column from sigGenes. Store this to Downregulated_genes object.
Downregulated_genes = sigGenes[sigGenes$log2FoldChange <= (-0.379), "ENTREZ"]

# Remove NA values.
Downregulated_genes = na.omit(Downregulated_genes)

Performing Gene Ontology (GO) enrichment analysis for Biological Process (BP) terms using the enrichGO function from the clusterProfiler package. The enrichment analysis is conducted for Upregulated and Downregulated genes based on their ENTREZID identifiers. We have used the Saccharomyces cerevisiae (baker’s yeast) database from the org.Sc.sgd.db to map our genes ENTREZ IDs to GO terms specific to yeast. Additionally, we have used a pvalueCutoff of 0.05 and also applied the Holm-Bonferroni correction to adjust p-values for multiple testing as stated in the paper.

# Perform enrichGO for Upregulated genes.
Upregulated_genesBP = enrichGO(gene = Upregulated_genes, OrgDb = "org.Sc.sgd.db", keyType = "ENTREZID", ont = "BP", pvalueCutoff = 0.05, pAdjustMethod = "holm")

# Perform enrichGO for Downregulated genes.
Downregulated_genesBP = enrichGO(gene = Downregulated_genes, OrgDb = "org.Sc.sgd.db", keyType = "ENTREZID", ont = "BP", pvalueCutoff = 0.05, pAdjustMethod = "holm")

# View Upregulated_genesBP as dataframe with scroll bars. 
kable(Upregulated_genesBP) %>% 
  kable_styling() %>%
  scroll_box(width = "1000px", height = "800px")

	ID	Description	GeneRatio	BgRatio	RichFactor	FoldEnrichment	zScore	pvalue	p.adjust	qvalue	geneID	Count
GO:0034219	GO:0034219	carbohydrate transmembrane transport	9/86	39/6436	0.2307692	17.270125	11.859644	0.00e+00	0.0000008	0.0000006	856640/855809/850490/851352/853207/850536/852601/850317/851944	9
GO:0005975	GO:0005975	carbohydrate metabolic process	18/86	256/6436	0.0703125	5.261991	8.097947	0.00e+00	0.0000026	0.0000011	855810/856639/850491/853209/853984/853207/852602/856440/853418/854644/854350/852601/850317/854525/853395/852128/856579/855480	18
GO:0008643	GO:0008643	carbohydrate transport	9/86	48/6436	0.1875000	14.031977	10.545935	0.00e+00	0.0000054	0.0000015	856640/855809/850490/851352/853207/850536/852601/850317/851944	9
GO:0016052	GO:0016052	carbohydrate catabolic process	11/86	91/6436	0.1208791	9.046256	8.995708	0.00e+00	0.0000140	0.0000029	850491/853209/853984/853207/852602/854644/850317/854525/853395/852128/856579	11
GO:1904659	GO:1904659	glucose transmembrane transport	6/86	23/6436	0.2608696	19.522750	10.355603	4.00e-07	0.0002241	0.0000378	856640/855809/851352/850536/850317/851944	6
GO:0008645	GO:0008645	hexose transmembrane transport	6/86	26/6436	0.2307692	17.270125	9.673535	9.00e-07	0.0004942	0.0000596	856640/855809/851352/850536/850317/851944	6
GO:0015749	GO:0015749	monosaccharide transmembrane transport	6/86	26/6436	0.2307692	17.270125	9.673535	9.00e-07	0.0004942	0.0000596	856640/855809/851352/850536/850317/851944	6
GO:0015761	GO:0015761	mannose transmembrane transport	5/86	16/6436	0.3125000	23.386628	10.433206	1.50e-06	0.0008183	0.0000695	856640/855809/851352/850536/851944	5
GO:0006096	GO:0006096	glycolytic process	6/86	29/6436	0.2068966	15.483560	9.096695	1.80e-06	0.0009820	0.0000695	853984/850317/854525/853395/852128/856579	6
GO:0009137	GO:0009137	purine nucleoside diphosphate catabolic process	6/86	29/6436	0.2068966	15.483560	9.096695	1.80e-06	0.0009820	0.0000695	853984/850317/854525/853395/852128/856579	6
GO:0009181	GO:0009181	purine ribonucleoside diphosphate catabolic process	6/86	29/6436	0.2068966	15.483560	9.096695	1.80e-06	0.0009820	0.0000695	853984/850317/854525/853395/852128/856579	6
GO:0046032	GO:0046032	ADP catabolic process	6/86	29/6436	0.2068966	15.483560	9.096695	1.80e-06	0.0009820	0.0000695	853984/850317/854525/853395/852128/856579	6
GO:0009191	GO:0009191	ribonucleoside diphosphate catabolic process	6/86	30/6436	0.2000000	14.967442	8.923201	2.20e-06	0.0012057	0.0000784	853984/850317/854525/853395/852128/856579	6
GO:0009134	GO:0009134	nucleoside diphosphate catabolic process	6/86	31/6436	0.1935484	14.484621	8.757833	2.70e-06	0.0014765	0.0000784	853984/850317/854525/853395/852128/856579	6
GO:0019364	GO:0019364	pyridine nucleotide catabolic process	6/86	31/6436	0.1935484	14.484621	8.757833	2.70e-06	0.0014765	0.0000784	853984/850317/854525/853395/852128/856579	6
GO:0046031	GO:0046031	ADP metabolic process	6/86	31/6436	0.1935484	14.484621	8.757833	2.70e-06	0.0014765	0.0000784	853984/850317/854525/853395/852128/856579	6
GO:0015755	GO:0015755	fructose transmembrane transport	5/86	18/6436	0.2777778	20.788114	9.783121	2.80e-06	0.0015461	0.0000784	856640/855809/851352/850536/851944	5
GO:0009135	GO:0009135	purine nucleoside diphosphate metabolic process	6/86	32/6436	0.1875000	14.031977	8.599957	3.30e-06	0.0017848	0.0000811	853984/850317/854525/853395/852128/856579	6
GO:0009179	GO:0009179	purine ribonucleoside diphosphate metabolic process	6/86	32/6436	0.1875000	14.031977	8.599957	3.30e-06	0.0017848	0.0000811	853984/850317/854525/853395/852128/856579	6
GO:0072526	GO:0072526	pyridine-containing compound catabolic process	6/86	33/6436	0.1818182	13.606765	8.449004	4.00e-06	0.0021503	0.0000932	853984/850317/854525/853395/852128/856579	6
GO:0009185	GO:0009185	ribonucleoside diphosphate metabolic process	6/86	34/6436	0.1764706	13.206566	8.304467	4.80e-06	0.0025785	0.0001066	853984/850317/854525/853395/852128/856579	6
GO:0009154	GO:0009154	purine ribonucleotide catabolic process	6/86	35/6436	0.1714286	12.829236	8.165888	5.70e-06	0.0030732	0.0001215	853984/850317/854525/853395/852128/856579	6
GO:0009261	GO:0009261	ribonucleotide catabolic process	6/86	36/6436	0.1666667	12.472868	8.032855	6.70e-06	0.0036419	0.0001380	853984/850317/854525/853395/852128/856579	6
GO:0009132	GO:0009132	nucleoside diphosphate metabolic process	6/86	37/6436	0.1621622	12.135764	7.904992	8.00e-06	0.0042926	0.0001561	853984/850317/854525/853395/852128/856579	6
GO:0006195	GO:0006195	purine nucleotide catabolic process	6/86	38/6436	0.1578947	11.816401	7.781962	9.40e-06	0.0050339	0.0001761	853984/850317/854525/853395/852128/856579	6
GO:0006090	GO:0006090	pyruvate metabolic process	6/86	40/6436	0.1500000	11.225581	7.549191	1.27e-05	0.0068383	0.0002305	853984/850317/854525/853395/852128/856579	6
GO:0009166	GO:0009166	nucleotide catabolic process	6/86	42/6436	0.1428571	10.691030	7.332379	1.70e-05	0.0091310	0.0002969	853984/850317/854525/853395/852128/856579	6
GO:0046352	GO:0046352	disaccharide catabolic process	4/86	13/6436	0.3076923	23.026834	9.251058	1.94e-05	0.0103718	0.0003258	853209/853207/852602/854644	4
GO:0072523	GO:0072523	purine-containing compound catabolic process	6/86	45/6436	0.1333333	9.978295	7.033190	2.56e-05	0.0136792	0.0004046	853984/850317/854525/853395/852128/856579	6
GO:0000023	GO:0000023	maltose metabolic process	4/86	14/6436	0.2857143	21.382060	8.884103	2.69e-05	0.0143191	0.0004046	853209/853207/852602/852601	4
GO:0009313	GO:0009313	oligosaccharide catabolic process	4/86	14/6436	0.2857143	21.382060	8.884103	2.69e-05	0.0143191	0.0004046	853209/853207/852602/854644	4
GO:0006091	GO:0006091	generation of precursor metabolites and energy	12/86	218/6436	0.0550459	4.119479	5.452826	2.82e-05	0.0149490	0.0004046	854695/850491/853984/852825/854350/855105/850317/854525/853395/852128/856579/855480	12
GO:1901292	GO:1901292	nucleoside phosphate catabolic process	6/86	46/6436	0.1304348	9.761375	6.939647	2.91e-05	0.0154474	0.0004046	853984/850317/854525/853395/852128/856579	6
GO:0005984	GO:0005984	disaccharide metabolic process	5/86	28/6436	0.1785714	13.363787	7.629673	2.92e-05	0.0154672	0.0004046	853209/853207/852602/854644/852601	5
GO:0009141	GO:0009141	nucleoside triphosphate metabolic process	7/86	70/6436	0.1000000	7.483721	6.347119	3.61e-05	0.0190442	0.0004849	853984/850317/854525/853395/852128/856579/856074	7
GO:0015791	GO:0015791	polyol transmembrane transport	4/86	16/6436	0.2500000	18.709302	8.253356	4.79e-05	0.0252210	0.0006229	856640/855809/850490/851352	4
GO:0009311	GO:0009311	oligosaccharide metabolic process	5/86	31/6436	0.1612903	12.070518	7.189950	4.90e-05	0.0257639	0.0006229	853209/853207/852602/854644/852601	5
GO:0046034	GO:0046034	ATP metabolic process	6/86	51/6436	0.1176471	8.804378	6.511478	5.31e-05	0.0278911	0.0006578	853984/850317/854525/853395/852128/856579	6
GO:1902600	GO:1902600	proton transmembrane transport	8/86	105/6436	0.0761905	5.701883	5.652843	7.11e-05	0.0372564	0.0008578	856640/855809/854695/851352/853207/850536/852601/851944	8
GO:0098662	GO:0098662	inorganic cation transmembrane transport	10/86	172/6436	0.0581395	4.351000	5.183815	8.76e-05	0.0458013	0.0010301	856640/855809/854695/851352/853207/856260/850536/852601/851608/851944	10
GO:0009205	GO:0009205	purine ribonucleoside triphosphate metabolic process	6/86	56/6436	0.1071429	8.018272	6.138334	9.08e-05	0.0473990	0.0010411	853984/850317/854525/853395/852128/856579	6
GO:0046496	GO:0046496	nicotinamide nucleotide metabolic process	7/86	81/6436	0.0864198	6.467413	5.762398	9.29e-05	0.0484163	0.0010411	853984/850317/854525/853395/852128/856579/855480	7

# View Downregulated_genesBP as dataframe with scroll bars. 
kable(Downregulated_genesBP) %>% 
  kable_styling() %>%
  scroll_box(width = "1000px", height = "800px")

	ID	Description	GeneRatio	BgRatio	RichFactor	FoldEnrichment	zScore	pvalue	p.adjust	qvalue	geneID	Count
GO:0045333	GO:0045333	cellular respiration	15/74	115/6436	0.1304348	11.344301	12.071201	0.00e+00	0.0000000	0.0000000	851013/855606/855866/856794/855969/855691/851726/856321/852235/854303/850361/853507/855732/854231/851111	15
GO:0009060	GO:0009060	aerobic respiration	14/74	99/6436	0.1414141	12.299208	12.218477	0.00e+00	0.0000000	0.0000000	851013/855606/855866/856794/855691/851726/856321/852235/854303/850361/853507/855732/854231/851111	14
GO:0006099	GO:0006099	tricarboxylic acid cycle	9/74	31/6436	0.2903226	25.250218	14.595878	0.00e+00	0.0000000	0.0000000	851013/855606/855866/856794/855691/851726/854303/850361/855732	9
GO:0015980	GO:0015980	energy derivation by oxidation of organic compounds	15/74	165/6436	0.0909091	7.906634	9.692444	0.00e+00	0.0000002	0.0000000	851013/855606/855866/856794/855969/855691/851726/856321/852235/854303/850361/853507/855732/854231/851111	15
GO:0045229	GO:0045229	external encapsulating structure organization	17/74	247/6436	0.0688259	5.985994	8.617540	0.00e+00	0.0000009	0.0000001	853766/856546/850992/856671/854421/855659/856541/852237/852459/853196/852897/856861/853282/855352/852012/852856/855355	17
GO:0071555	GO:0071555	cell wall organization	17/74	247/6436	0.0688259	5.985994	8.617540	0.00e+00	0.0000009	0.0000001	853766/856546/850992/856671/854421/855659/856541/852237/852459/853196/852897/856861/853282/855352/852012/852856/855355	17
GO:0006091	GO:0006091	generation of precursor metabolites and energy	16/74	218/6436	0.0733945	6.383337	8.720629	0.00e+00	0.0000012	0.0000001	851013/855606/855866/856794/855969/855691/851726/856321/852235/854303/854262/850361/853507/855732/854231/851111	16
GO:0071554	GO:0071554	cell wall organization or biogenesis	18/74	297/6436	0.0606061	5.271089	8.127582	0.00e+00	0.0000022	0.0000002	853766/856546/850992/856671/854421/855659/856541/852237/852459/853196/852897/856861/853282/855352/852012/854475/852856/855355	18
GO:0005975	GO:0005975	carbohydrate metabolic process	14/74	256/6436	0.0546875	4.756335	6.614291	9.00e-07	0.0005514	0.0000473	853972/850992/855606/851092/856671/856794/855659/853196/854262/855352/850361/855732/852856/855355	14
GO:0051301	GO:0051301	cell division	15/74	296/6436	0.0506757	4.407414	6.472617	1.00e-06	0.0005617	0.0000473	856546/850992/855819/855427/852455/855659/855239/854666/852897/856861/853003/850334/855389/854475/852119	15
GO:0000920	GO:0000920	septum digestion after cytokinesis	5/74	20/6436	0.2500000	21.743243	10.019655	2.40e-06	0.0013970	0.0001071	856546/850992/852455/856861/855389	5
GO:0019752	GO:0019752	carboxylic acid metabolic process	14/74	370/6436	0.0378378	3.290869	4.894894	6.49e-05	0.0381142	0.0026827	855606/855866/856794/856539/856108/855691/851726/854303/850361/856745/854950/850295/855732/855786	14

# Plotting our results of Upregulated_genesBP in a barplot using ggplot2 and ggrepel for customization of the title.
Up_genesBP.plot = barplot(Upregulated_genesBP, showCategory = 30) + ggtitle("GO term Enrichment Analysis of Upregulated Genes (BP)") + 
  theme(plot.title = element_text(size = 16, face = "bold", hjust = 0.5))

# Extracting Up_genesBP.plot into a PNG file and saving it to our current directory.
#png("./Up_genesBP.png", res = 300, width = 3200, height = 5100)

# Renders the Up_genesBP.plot and writes it to the PNG file.
print(Up_genesBP.plot)

# "dev.off()" closes the graphics device, freeing up memory.
#dev.off()

# Plotting our results of Downregulated_genesBP in a barplot using ggplot2 and ggrepel for customization of the title.
Down_genesBP.plot = barplot(Downregulated_genesBP, showCategory = 20) + ggtitle("GO term Enrichment Analysis of Downregulated Genes (BP)") +
  theme(plot.title = element_text(size = 16, face = "bold", hjust = 0.5))

# Extracting Down_genesBP.plot into a PNG file and saving it to our current directory.
#png("./Down_genesBP.png", res = 300, width = 3200, height = 5100)

# Renders the Down_genesBP.plot and writes it to the PNG file.
print(Down_genesBP.plot)

# "dev.off()" closes the graphics device, freeing up memory.
#dev.off()

Performing Gene Ontology (GO) enrichment analysis for Molecular Function (MF) terms using the enrichGO function from the clusterProfiler package. The enrichment analysis is conducted for Upregulated and Downregulated genes based on their ENTREZID identifiers. We have used the Saccharomyces cerevisiae (baker’s yeast) database from the org.Sc.sgd.db to map our genes ENTREZ IDs to GO terms specific to yeast. Additionally, we have used a pvalueCutoff of 0.05 and also applied the Holm-Bonferroni correction to adjust p-values for multiple testing as stated in the paper.

# Perform enrichGO for Upregulated genes.
Upregulated_genesMF = enrichGO(gene = Upregulated_genes, OrgDb = "org.Sc.sgd.db", keyType = "ENTREZID", ont = "MF", pvalueCutoff = 0.05, pAdjustMethod = "holm")

# Perform enrichGO for Downregulated genes.
Downregulated_genesMF = enrichGO(gene = Downregulated_genes, OrgDb = "org.Sc.sgd.db", keyType = "ENTREZID", ont = "MF", pvalueCutoff = 0.05, pAdjustMethod = "holm")

# View Upregulated_genesMF as dataframe with scroll bars.
kable(Upregulated_genesMF) %>%
  kable_styling() %>%
  scroll_box(width = "1000px", height = "800px")

	ID	Description	GeneRatio	BgRatio	RichFactor	FoldEnrichment	zScore	pvalue	p.adjust	qvalue	geneID	Count
GO:0015144	GO:0015144	carbohydrate transmembrane transporter activity	8/86	30/6436	0.2666667	19.956589	12.110554	0.0000000	0.0000006	0.0000005	856640/855809/850490/851352/853207/850536/852601/851944	8
GO:0005351	GO:0005351	carbohydrate:proton symporter activity	7/86	25/6436	0.2800000	20.954419	11.632762	0.0000000	0.0000045	0.0000012	856640/855809/851352/853207/850536/852601/851944	7
GO:0005402	GO:0005402	carbohydrate:monoatomic cation symporter activity	7/86	25/6436	0.2800000	20.954419	11.632762	0.0000000	0.0000045	0.0000012	856640/855809/851352/853207/850536/852601/851944	7
GO:0015295	GO:0015295	solute:proton symporter activity	7/86	34/6436	0.2058824	15.407661	9.801933	0.0000002	0.0000451	0.0000094	856640/855809/851352/853207/850536/852601/851944	7
GO:0015294	GO:0015294	solute:monoatomic cation symporter activity	7/86	37/6436	0.1891892	14.158391	9.340803	0.0000004	0.0000831	0.0000140	856640/855809/851352/853207/850536/852601/851944	7
GO:0005353	GO:0005353	fructose transmembrane transporter activity	5/86	15/6436	0.3333333	24.945736	10.804611	0.0000010	0.0001916	0.0000231	856640/855809/851352/850536/851944	5
GO:0015578	GO:0015578	mannose transmembrane transporter activity	5/86	15/6436	0.3333333	24.945736	10.804611	0.0000010	0.0001916	0.0000231	856640/855809/851352/850536/851944	5
GO:0015293	GO:0015293	symporter activity	7/86	43/6436	0.1627907	12.182802	8.561880	0.0000013	0.0002400	0.0000256	856640/855809/851352/853207/850536/852601/851944	7
GO:0005355	GO:0005355	glucose transmembrane transporter activity	5/86	17/6436	0.2941176	22.010944	10.094225	0.0000021	0.0003803	0.0000363	856640/855809/851352/850536/851944	5
GO:0015145	GO:0015145	monosaccharide transmembrane transporter activity	5/86	20/6436	0.2500000	18.709302	9.230408	0.0000050	0.0009182	0.0000660	856640/855809/851352/850536/851944	5
GO:0015149	GO:0015149	hexose transmembrane transporter activity	5/86	20/6436	0.2500000	18.709302	9.230408	0.0000050	0.0009182	0.0000660	856640/855809/851352/850536/851944	5
GO:0051119	GO:0051119	sugar transmembrane transporter activity	5/86	20/6436	0.2500000	18.709302	9.230408	0.0000050	0.0009182	0.0000660	856640/855809/851352/850536/851944	5
GO:0022853	GO:0022853	active monoatomic ion transmembrane transporter activity	9/86	102/6436	0.0882353	6.603283	6.638046	0.0000074	0.0013308	0.0000898	856640/855809/854695/851352/853207/850536/852601/851608/851944	9
GO:0015291	GO:0015291	secondary active transmembrane transporter activity	8/86	90/6436	0.0888889	6.652196	6.283840	0.0000233	0.0041626	0.0002623	856640/855809/851352/853207/850536/852601/853039/851944	8
GO:0015318	GO:0015318	inorganic molecular entity transmembrane transporter activity	11/86	188/6436	0.0585106	4.378773	5.471470	0.0000357	0.0063505	0.0003755	856640/855809/854695/850490/851352/853207/856260/850536/852601/851608/851944	11
GO:0022890	GO:0022890	inorganic cation transmembrane transporter activity	10/86	161/6436	0.0621118	4.648274	5.455444	0.0000500	0.0088551	0.0004937	856640/855809/854695/851352/853207/856260/850536/852601/851608/851944	10
GO:0015078	GO:0015078	proton transmembrane transporter activity	8/86	103/6436	0.0776699	5.812599	5.729677	0.0000620	0.0109047	0.0005755	856640/855809/854695/851352/853207/850536/852601/851944	8
GO:0022804	GO:0022804	active transmembrane transporter activity	10/86	176/6436	0.0568182	4.252114	5.090631	0.0001062	0.0185800	0.0009313	856640/855809/854695/851352/853207/850536/852601/851608/853039/851944	10

# View Downregulated_genesMF as dataframe with scroll bars.
kable(Downregulated_genesMF) %>% 
  kable_styling() %>%
  scroll_box(width = "1000px", height = "800px")

	ID	Description	GeneRatio	BgRatio	RichFactor	FoldEnrichment	zScore	pvalue	p.adjust	qvalue	geneID	Count
GO:0016798	GO:0016798	hydrolase activity, acting on glycosyl bonds	10/74	62/6436	0.1612903	14.027899	11.116229	0.0000000	0.0000002	0.0000002	856546/850992/856671/855659/852459/853196/855352/855389/852481/852856	10
GO:0004553	GO:0004553	hydrolase activity, hydrolyzing O-glycosyl compounds	9/74	49/6436	0.1836735	15.974628	11.347467	0.0000000	0.0000005	0.0000002	856546/850992/856671/855659/853196/855352/855389/852481/852856	9
GO:0015926	GO:0015926	glucosidase activity	5/74	30/6436	0.1666667	14.495496	7.990046	0.0000200	0.0029751	0.0008127	856546/855659/853196/855352/852856	5
GO:0022890	GO:0022890	inorganic cation transmembrane transporter activity	9/74	161/6436	0.0559006	4.861843	5.351728	0.0000842	0.0124663	0.0022584	852599/856321/852235/852177/856779/854231/852149/851111/851560	9
GO:0008324	GO:0008324	monoatomic cation transmembrane transporter activity	9/74	165/6436	0.0545455	4.743980	5.254125	0.0001018	0.0149688	0.0022584	852599/856321/852235/852177/856779/854231/852149/851111/851560	9
GO:0015075	GO:0015075	monoatomic ion transmembrane transporter activity	9/74	173/6436	0.0520231	4.524605	5.067989	0.0001464	0.0213733	0.0022584	852599/856321/852235/852177/856779/854231/852149/851111/851560	9
GO:0015078	GO:0015078	proton transmembrane transporter activity	7/74	103/6436	0.0679612	5.910785	5.418238	0.0001637	0.0237375	0.0022584	856321/852235/852177/854231/852149/851111/851560	7
GO:0046912	GO:0046912	acyltransferase activity, acyl groups converted into alkyl on transfer	3/74	10/6436	0.3000000	26.091892	8.563600	0.0001653	0.0237985	0.0022584	855606/850361/855732	3
GO:0009055	GO:0009055	electron transfer activity	5/74	46/6436	0.1086957	9.453584	6.205298	0.0001665	0.0238034	0.0022584	856321/852235/853507/854231/851111	5
GO:0015318	GO:0015318	inorganic molecular entity transmembrane transporter activity	9/74	188/6436	0.0478723	4.163600	4.747700	0.0002742	0.0389364	0.0031277	852599/856321/852235/852177/856779/854231/852149/851111/851560	9
GO:0015453	GO:0015453	oxidoreduction-driven active transmembrane transporter activity	4/74	30/6436	0.1333333	11.596396	6.273626	0.0003519	0.0496246	0.0031277	856321/852235/854231/851111	4
GO:0020037	GO:0020037	heme binding	4/74	30/6436	0.1333333	11.596396	6.273626	0.0003519	0.0496246	0.0031277	852979/854950/853507/854231	4
GO:0046906	GO:0046906	tetrapyrrole binding	4/74	30/6436	0.1333333	11.596396	6.273626	0.0003519	0.0496246	0.0031277	852979/854950/853507/854231	4

# Plotting our results of Upregulated_genesMF in a barplot using ggplot2 and ggrepel for customization of the title.
Up_genesMF.plot = barplot(Upregulated_genesMF, showCategory = 30) + ggtitle("GO term Enrichment Analysis of Upregulated Genes (MF)") + 
  theme(plot.title = element_text(size = 16, face = "bold", hjust = 0.5))

# Extracting Up_genesMF.plot into a PNG file and saving it to our current directory.
#png("./Up_genesMF.png", res = 300, width = 3200, height = 5100)

# Renders the Up_genesMF.plot and writes it to the PNG file.
print(Up_genesMF.plot)

# "dev.off()" closes the graphics device, freeing up memory.
#dev.off()

# Plotting our results of Downregulated_genesMF in a barplot using ggplot2 and ggrepel for customization of the title.
Down_genesMF.plot = barplot(Downregulated_genesMF, showCategory = 20) + ggtitle("GO term Enrichment Analysis of Downregulated Genes (MF)") + 
  theme(plot.title = element_text(size = 16, face = "bold", hjust = 0.5))

# Extracting Down_genesMF.plot into a PNG file and saving it to our current directory.
#png("./Down_genesMF.png", res = 300, width = 3200, height = 5100)

# Renders the Down_genesMF.plot and writes it to the PNG file.
print(Down_genesMF.plot)

# "dev.off()" closes the graphics device, freeing up memory.
#dev.off()

Performing Gene Ontology (GO) enrichment analysis for Cellular Component (CC) terms using the enrichGO function from the clusterProfiler package. The enrichment analysis is conducted for Upregulated and Downregulated genes based on their ENTREZID identifiers. We have used the Saccharomyces cerevisiae (baker’s yeast) database from the org.Sc.sgd.db to map our genes ENTREZ IDs to GO terms specific to yeast. Additionally, we have used a pvalueCutoff of 0.05 and also applied the Holm-Bonferroni correction to adjust p-values for multiple testing as stated in the paper.

# Perform enrichGO for Upregulated genes.
Upregulated_genesCC = enrichGO(gene = Upregulated_genes, OrgDb = "org.Sc.sgd.db", keyType = "ENTREZID", ont = "CC", pvalueCutoff = 0.05, pAdjustMethod = "holm")

# Perform enrichGO for Downregulated genes.
Downregulated_genesCC = enrichGO(gene = Downregulated_genes, OrgDb = "org.Sc.sgd.db", keyType = "ENTREZID", ont = "CC", pvalueCutoff = 0.05, pAdjustMethod = "holm")

# View Downregulated_genesCC as dataframe with scroll bars.
kable(Downregulated_genesCC) %>% 
  kable_styling() %>%
  scroll_box(width = "1000px", height = "800px")

	ID	Description	GeneRatio	BgRatio	RichFactor	FoldEnrichment	zScore	pvalue	p.adjust	qvalue	geneID	Count
GO:0005576	GO:0005576	extracellular region	17/74	121/6435	0.1404959	12.217445	13.434702	0.0000000	0.0000000	0.0000000	853766/856546/850992/856671/856794/854421/855659/855416/852459/853366/853196/852897/853282/855352/855389/852856/855355	17
GO:0009277	GO:0009277	fungal-type cell wall	15/74	132/6435	0.1136364	9.881757	11.120028	0.0000000	0.0000000	0.0000000	853766/856546/850992/856671/854421/855659/855940/855416/852459/853196/853282/855352/855389/852856/855355	15
GO:0005618	GO:0005618	cell wall	15/74	133/6435	0.1127820	9.807458	11.069573	0.0000000	0.0000000	0.0000000	853766/856546/850992/856671/854421/855659/855940/855416/852459/853196/853282/855352/855389/852856/855355	15
GO:0030312	GO:0030312	external encapsulating structure	15/74	134/6435	0.1119403	9.734268	11.019651	0.0000000	0.0000000	0.0000000	853766/856546/850992/856671/854421/855659/855940/855416/852459/853196/853282/855352/855389/852856/855355	15
GO:0009986	GO:0009986	cell surface	6/74	18/6435	0.3333333	28.986487	12.823669	0.0000000	0.0000033	0.0000004	854421/855659/853196/852897/855352/852856	6
GO:1990204	GO:1990204	oxidoreductase complex	7/74	42/6435	0.1666667	14.493243	9.461987	0.0000004	0.0000402	0.0000040	856801/855691/851726/856321/852235/854303/854231	7
GO:0009295	GO:0009295	nucleoid	6/74	28/6435	0.2142857	18.634170	10.085568	0.0000006	0.0000592	0.0000044	851013/855691/851726/852177/850295/856682	6
GO:0042645	GO:0042645	mitochondrial nucleoid	6/74	28/6435	0.2142857	18.634170	10.085568	0.0000006	0.0000592	0.0000044	851013/855691/851726/852177/850295/856682	6
GO:0045239	GO:0045239	tricarboxylic acid cycle enzyme complex	4/74	10/6435	0.4000000	34.783784	11.530959	0.0000032	0.0003212	0.0000218	855866/855691/851726/854303	4
GO:0070469	GO:0070469	respirasome	6/74	41/6435	0.1463415	12.725775	8.123456	0.0000062	0.0006096	0.0000376	856321/852235/854950/853507/854231/851111	6
GO:0005621	GO:0005621	cellular bud scar	4/74	14/6435	0.2857143	24.845560	9.633052	0.0000148	0.0014490	0.0000821	856546/855659/855389/855355	4
GO:0005933	GO:0005933	cellular bud	12/74	257/6435	0.0466926	4.060364	5.400194	0.0000307	0.0029763	0.0001561	850992/856671/852455/854421/854666/856861/853282/850334/855389/854437/854903/852119	12
GO:0005750	GO:0005750	mitochondrial respiratory chain complex III	3/74	10/6435	0.3000000	26.087838	8.562890	0.0001653	0.0158729	0.0007210	856321/852235/854231	3
GO:0045275	GO:0045275	respiratory chain complex III	3/74	10/6435	0.3000000	26.087838	8.562890	0.0001653	0.0158729	0.0007210	856321/852235/854231	3
GO:0000307	GO:0000307	cyclin-dependent protein kinase holoenzyme complex	4/74	28/6435	0.1428571	12.422780	6.533068	0.0002677	0.0251654	0.0010216	855819/855427/855239/853003	4
GO:0070069	GO:0070069	cytochrome complex	4/74	28/6435	0.1428571	12.422780	6.533068	0.0002677	0.0251654	0.0010216	856321/852235/854231/851111	4
GO:0098800	GO:0098800	inner mitochondrial membrane protein complex	6/74	86/6435	0.0697674	6.066939	5.101947	0.0004280	0.0393777	0.0014676	856321/852235/852177/854231/851111/851560	6
GO:0005759	GO:0005759	mitochondrial matrix	10/74	244/6435	0.0409836	3.563912	4.403636	0.0004327	0.0393777	0.0014676	851013/855866/855691/851726/854303/850361/852177/850295/855732/856682	10
GO:0098552	GO:0098552	side of membrane	6/74	89/6435	0.0674157	5.862436	4.981871	0.0005149	0.0463367	0.0016544	853766/856546/856671/855416/855389/855355	6

# Plotting our results of Downregulated_genesCC in a barplot using ggplot2 and ggrepel for customization of the title.
Down_genesCC.plot = barplot(Downregulated_genesCC, showCategory = 20) + ggtitle("GO term Enrichment Analysis of Downregulated Genes (CC)") + 
  theme(plot.title = element_text(size = 16, face = "bold", hjust = 0.5))

# Extracting Down_genesCC.plot into a PNG file and saving it to our current directory.
#png("./Down_genesCC.png", res = 300, width = 3200, height = 5100)

# Renders the Down_genesCC.plot and writes it to the PNG file.
print(Down_genesCC.plot)

# "dev.off()" closes the graphics device, freeing up memory.
#dev.off()

# View Upregulated_genesCC as dataframe.
as.data.frame(Upregulated_genesCC)

##  [1] ID             Description    GeneRatio      BgRatio        RichFactor    
##  [6] FoldEnrichment zScore         pvalue         p.adjust       qvalue        
## [11] geneID         Count         
## <0 rows> (or 0-length row.names)

Pathway Enrichment Analysis with clusterProfiler

Performing KEGG pathway enrichment analysis for upregulated and downregulated genes using the enrichKEGG function from the clusterProfiler package. The pathway enrichment analysis is conducted for Upregulated and Downregulated genes based on NCBI gene identifiers (ENTREZ IDs) that match to their respective IDs in KEGG pathway database. organism = “sce” specifies that the organism is Saccharomyces cerevisiae. KEGG uses a three-letter organism code; “sce” corresponds to yeast. Additionally, we have used a pvalueCutoff of 0.05 and also applied the Holm-Bonferroni correction to adjust p-values for multiple testing as stated in the paper.

# Perform enrichKEGG for Upregulated_genes and store the results to Path_Upregulated_genes.
Path_Upregulated_genes = enrichKEGG(gene = Upregulated_genes, organism = "sce", keyType = "ncbi-geneid", pvalueCutoff = 0.05, pAdjustMethod = "holm", minGSSize = 5)

## Reading KEGG annotation online: "https://rest.kegg.jp/link/sce/pathway"...

## Reading KEGG annotation online: "https://rest.kegg.jp/list/pathway/sce"...

## Reading KEGG annotation online: "https://rest.kegg.jp/conv/ncbi-geneid/sce"...

# Perform enrichKEGG for Downregulated_genes and store the results to Path_Downregulated_genes.
Path_Downregulated_genes = enrichKEGG(gene = Downregulated_genes, organism = "sce", keyType = "ncbi-geneid", pvalueCutoff = 0.05, pAdjustMethod = "holm", minGSSize = 5)

# View Path_Upregulated_genes as dataframe with scroll bars.
kable(Path_Upregulated_genes) %>% 
  kable_styling() %>%
  scroll_box(width = "1000px", height = "500px")

	category	subcategory	ID	Description	GeneRatio	BgRatio	RichFactor	FoldEnrichment	zScore	pvalue	p.adjust	qvalue	geneID	Count
sce00051	Metabolism	Carbohydrate metabolism	sce00051	Fructose and mannose metabolism - Saccharomyces cerevisiae (budding yeast)	6/38	23/2545	0.2608696	17.471396	9.767760	0.0000006	0.0000241	0.0000162	855810/856639/850491/853984/850317/852128	6
sce00052	Metabolism	Carbohydrate metabolism	sce00052	Galactose metabolism - Saccharomyces cerevisiae (budding yeast)	5/38	24/2545	0.2083333	13.952851	7.847977	0.0000196	0.0007438	0.0002576	853209/852602/854644/850317/852128	5
sce00500	Metabolism	Carbohydrate metabolism	sce00500	Starch and sucrose metabolism - Saccharomyces cerevisiae (budding yeast)	5/38	41/2545	0.1219512	8.167523	5.695305	0.0002868	0.0106117	0.0022571	853209/852602/854644/850317/852128	5
sce04081	NA	NA	sce04081	Hormone signaling - Saccharomyces cerevisiae (budding yeast)	3/38	10/2545	0.3000000	20.092105	7.446257	0.0003431	0.0123510	0.0022571	854052/854160/852102	3

# View Path_Downregulated_genes as dataframe with scroll bars.
kable(Path_Downregulated_genes) %>% 
  kable_styling() %>%
  scroll_box(width = "1000px", height = "800px")

	category	subcategory	ID	Description	GeneRatio	BgRatio	RichFactor	FoldEnrichment	zScore	pvalue	p.adjust	qvalue	geneID	Count
sce01200	Metabolism	Global and overview maps	sce01200	Carbon metabolism - Saccharomyces cerevisiae (budding yeast)	15/40	113/2545	0.1327434	8.445797	10.229473	0.0000000	0.0000000	0.0000000	853972/852979/851013/855606/851092/855866/856794/856108/855691/851726/854303/854262/850361/850295/855732	15
sce00020	Metabolism	Carbohydrate metabolism	sce00020	Citrate cycle (TCA cycle) - Saccharomyces cerevisiae (budding yeast)	8/40	31/2545	0.2580645	16.419355	10.913124	0.0000000	0.0000004	0.0000001	853972/851013/855866/855691/851726/854303/850361/855732	8
sce00630	Metabolism	Carbohydrate metabolism	sce00630	Glyoxylate and dicarboxylate metabolism - Saccharomyces cerevisiae (budding yeast)	6/40	29/2545	0.2068966	13.163793	8.323335	0.0000037	0.0001304	0.0000301	852979/851013/855606/856794/850361/855732	6
sce01210	Metabolism	Global and overview maps	sce01210	2-Oxocarboxylic acid metabolism - Saccharomyces cerevisiae (budding yeast)	6/40	42/2545	0.1428571	9.089286	6.678656	0.0000354	0.0012034	0.0002142	851013/855691/851726/854303/850361/855732	6
sce01110	Metabolism	Global and overview maps	sce01110	Biosynthesis of secondary metabolites - Saccharomyces cerevisiae (budding yeast)	16/40	360/2545	0.0444444	2.827778	4.728611	0.0000455	0.0015000	0.0002201	853972/852979/851013/855606/851092/855866/856794/856539/855691/851726/854303/854262/850361/850295/855732/855786	16
sce00190	Metabolism	Energy metabolism	sce00190	Oxidative phosphorylation - Saccharomyces cerevisiae (budding yeast)	7/40	79/2545	0.0886076	5.637658	5.290545	0.0001734	0.0055503	0.0006999	856321/852235/852177/853507/854231/851111/851560	7
sce01230	Metabolism	Global and overview maps	sce01230	Biosynthesis of amino acids - Saccharomyces cerevisiae (budding yeast)	8/40	125/2545	0.0640000	4.072000	4.449930	0.0005516	0.0170996	0.0019078	851013/855691/854303/854262/850361/850295/855732/855786	8

# Plotting our results of Path_Upregulated_genes in a barplot using ggplot2 and ggrepel for customization of the title.
Up_genesPath_plot = barplot(Path_Upregulated_genes, showCategory = 30) + ggtitle("Pathway Enrichment Analysis of Upregulated Genes") + 
  theme(plot.title = element_text(size = 16, face = "bold", hjust = 0.5))

# Extracting Up_genesPath_plot into a PNG file and saving it to our current directory.
#png("./Up_genesPath_plot.png", res = 300, width = 3000, height = 1000)

# Renders the Up_genesPath_plot and writes it to the PNG file.
print(Up_genesPath_plot)

# "dev.off()" closes the graphics device, freeing up memory.
#dev.off()

# Plotting our results of Path_Downregulated_genes in a barplot using ggplot2 and ggrepel for customization of the title.
Down_genesPath_plot = barplot(Path_Downregulated_genes, showCategory = 20) + ggtitle("Pathway Enrichment Analysis of Downregulated Genes") + theme(plot.title = element_text(size = 16, face = "bold", hjust = 0.5))

# Extracting Down_genesPath_plot into a PNG file and saving it to our current directory.
#png("./Down_genesPath_plot.png", res = 300, width = 3500, height = 2200)

# Renders the Down_genesPath_plot and writes it to the PNG file.
print(Down_genesPath_plot)

# "dev.off()" closes the graphics device, freeing up memory.
#dev.off()