On the ability of
Silene vulgaris
to deal with copper contaminants in soil
Aim of the project
-
To improve
de novo
transcriptome assembly of
S. vulgaris
-
To compare transcriptomes of two
S. vulgaris
populations
-
To find key genes responsible for copper tolerance
Input: RNA-seq datasets (24 samples)
-
S. vulgaris
population Stranska skala (nonmetallicolous population from Brno, Czech Republic: N 49.190146, E 16.675319)
-
S. vulgaris
population Lubietova (metallicolous population from Slovakia: N 48.747893, E 19.385472)
The flowchart of
de novo
transcriptome assembly and differential gene transcription analyses
Input: 48 fastq files (paired end RNA sequencing, Illumina HiSeq, 24 samples )
Filtering - the downstream analyses of the raw Trinity's output
1.) input: 289 610 sequences (raw output of Trinity
de novo
assembly)
-
-
Purpose:
to identify all misassembled and nonsense contigs
-
Description:
the re-mapping of all reads (the concatenated fastq files) to the
de novo
assembled transcriptome (reference)
-
Solution:
all contigs with TPM value equal 0 or/and IsoPct value equal 0 are excluded
-
-
-
Output:
186 921 sequences
2.) input: 186 921 sequences (output from the first step )
-
-
Purpose:
to eliminate the number of duplicates, isoforms and splicing forms
-
Description:
the generation of consensus sequences based on the multiple sequence alignments of contigs
-
Solution:
to keep only one consensus sequence instead of multiple isoforms
-
Note: Isoform - de Bruijn graph per expressed gene
Trinity assembly method
-
Output:
127 373 sequences
3.) input: 127 373 sequences (output from the second step )
-
Tool:
blastn, e-value cut-off 1e
-10
-
Purpose:
to remove possible contaminants - sequences from plastid genomes, plant pathogens, known transposable elements,
S. vulgaris
18S rRNA, known plant repeats; and to select one representative sequence per group of very similar sequences (e-value less than
-50
)
-
Description:
if a contig is not unique, select the longest sequence from a group of similar sequences based on bd-blastn; if the unique or selected sequence does NOT give the best blastn hit with a repeat/pathogen/TE/chlorplastome/mitochonriome/rRNA sequence, the sequence is kept among filtered de novo assembled transcripts and assumed as a product of protein coding nuclear gene transcription
-
Solution:
-
blastn against databases of plastid genomes, plant pathogens, repeats
-
bi-derectional blastn (queried sequences are the same as a set of sequences used to build a reference database)
-
-
Note: plant repeat element database
mips
-
Note: database of ribosomal RNA sequences
SILVA db
-
Note: database of repeats and transposable elements
repbase
-
Note: database of plant pathogen's sequences
plant pathogens DB
-
Output:
53 107 sequences
BUSCO results for particular steps and attempts during the assembly
BUSCO results of our best transcriptome assembly and other publicly available assemblies
The used plant genomes downloaded from
Phytozome DB
Silene vulgaris
publicly available transcriptome
Taylor Lab
Reload
The best Athaliana blastX hits (max=3, e-value < 0.001) with Silene vulgaris transcriptome:
contig1
Results of differential gene expression analysis
de novo
assembly
Raw output from Trinity:
SV_rawOutput_Trinity.fa
Filtered assembly estimated as the best one:
SV_filtered_53107seqs_Trinity.fa
SAF files from the best filtered assembly dedicated for FeatureCounts tool:
SV_filtered_53107seqs_Trinity.saf
Protein prediction in TransDecoder tool
gff3 file of all predicted proteins (longer than 30 AAs):
Silene_vulgaris.gff3
All predicted ORFs (longer than 30AAs), fasta file:
Svulgaris_allORFs_nt.fa
All predicted proteins (longer than 30AAs), fasta file:
Svulgaris_proteins_AA.fa
Filtered set of proteins (one protein per contig), fasta file:
Svulgaris_proteins_filteredSet.fa
Domain prediction
Simplified table of found domain(s) per protein (input: the filtered set of proteins):
SV_filtered_domainPerSeq.csv
Differential gene transcription analyses:
Raw count tables per tissue (12 samples):
Svulgaris_roots_RawCounts_48103seqs.csv
Svulgaris_leaves_RawCounts_49617seqs.csv
regularized log normalized (rlog) tables per tissue (12 samples):
lower cut-off: mean coverage more than 15 mapped reads per contig, median coverage more than 9 mapped reads per contig
upper cut-off: mean less than 100 000 mapped reads per contig, median less than 100 000 mapped reads per contig
Svulgaris_roots_rlogs_23469seqs.csv
Svulgaris_leaves_rlogs_22876seqs.csv
DESeq2 results
roots - SS versus LUB:
ref: SS population , query: LUB population
Svulgaris_roots_SSvsLUB_DESeq2_results.csv
roots - LUB control versus LUB copper:
ref: LUB control, query: LUB copper enriched soil
Svulgaris_roots_LUB_control_vs_copperTr_results.csv
roots - SS control versus SS copper:
ref: SS control, query: SS copper enriched soil
Svulgaris_roots_SS_control_vs_copperTr_results.csv
leaves - SS versus LUB:
ref: SS population, query: LUB population
Svulgaris_leaves_SSvsLUB_DESeq2_results.csv
Novel transposons reconstructed in
Silene latifolia
:
fasta file of the reconstructed TEs, for details see
Kralova et al., 2014
Silene_latifolia_novelTEs.fa
Table with raw read counts mapped onto repeats and transposons
SV_TEs_counts_results.csv
Discussion and highlights