De novo transcriptome assembly of the Baikal stone sculpin Paracottus knerii (Dybowski, 1874) from the Angara River as the first transcriptomic resource for Lake Baikal Cottidae

Vasiliy Pomazkin1, Polina Drozdova1,2, Ekaterina Borvinskaya1, Elizaveta Indosova1, Maxim Timofeyev1, Anton Gurkov1,2

1Institute of Biology, Irkutsk State University, 3 Lenin St., Irkutsk, 664025, Russia

2Baikal Research Centre, 5v Rabochaya St., Irkutsk, 664011, Russia

Corresponding author: Anton Gurkov (a.n.gurkov@gmail.com)

Academic editor: R. Yakovlev | Received: 23 April 2025 | Accepted: 10 April 2025 | Published: 17 April 2025

http://zoobank.org/3C097B75-D850-4DF7-B002-3A8C13712955

Citation: Pomazkin V, Drozdova P, Borvinskaya E, Indosova E, Timofeyev M, Gurkov A (2025) De novo transcriptome assembly of the Baikal stone sculpin Paracottus knerii (Dybowski, 1874) from the Angara River as the first transcriptomic resource for Lake Baikal Cottidae. Acta Biologica Sibirica 11: 541–550. https://doi.org/10.5281/zenodo.15208846

Abstract

Lake Baikal is the largest freshwater reservoir inhabited by a number of endemic species flocks belonging to different taxa. However, genomic resources for these groups of endemics are still relatively scarce, limiting understanding of the molecular mechanisms behind their physiology. One of these species flocks are Baikal sculpins of the family Cottidae. Here, we present the first transcriptome assembly for this group using stone sculpin Paracottus knerii (Dybowsky, 1874) as a model. The transcriptome was obtained from the whole body of a P. knerii fry and contains a diverse array of immunity-related transcripts, paving the way for studies investigating the immune response of Lake Baikal sculpins.
Keywords: Benthic fish, Baikal, Cottidae, immune system, immune response, transcriptome, sculpin

Introduction

Ancient Lake Baikal, containing a fifth of the surface liquid freshwater on the planet, is known for extensive endemic biodiversity (Cristescu et al. 2010; Brown et al. 2021). The variety of animal species, which belong to a number of taxonomic groups, complicates the accumulation of the databases of protein- and rRNA-coding sequences for the endemics, which in turn hampers physiological studies, monitoring programs, and conservation efforts. Baikal sculpins (family Cottidae, Pisces) represent one of the adaptive radiations that have happened in the lake. These fishes diversified to the amazing variety of ~40 species (Goto et al. 2015; Bogdanov 2023) and inhabit all depths of the lake. Some species descend to the Angara river, the only outflow of the lake (Taliev 1955; Bogdanov 2023). However, the genetics of this group remains relatively less explored than morphology (Sandel et al. 2024).

For six species of Baikal Cottidae, mitochondrial genomes have been published (Sandel et al. 2017; Mugue et al. 2021), but while undoubtedly useful for species identification and population genomics, they do not offer much for researchers who wish to study the diversity and expression of protein-coding genes. A chromosome-level genome assembly for Paracottus knerii (Dybowsky 1874) was announced (Mugue et al. 2023), but has not been made publicly available yet.

To bridge this gap, we here aim to obtain exome-wide data for P. knerii as one of the most common species by sequencing and assembling a whole-body transcriptome. P. knerii, the so-called stone sculpin, inhabits the littoral zone of the lake, which is the part subjected to climate change and local coastal eutrophication (Brown et al. 2021). Furthermore, it has been used as a model species to study noise exposure as a component of human-induced rapid environmental change and for ecotoxicological research (Sapozhnikova et al. 2021; Sudakov et al. 2022). This species descends to the Angara and Yenisei rivers. Individuals from Angara are slightly different from those from Baikal in terms of morphology, but their morphometric results overlap (Bogdanov 2007).

Materials and methods

Animal sampling and maintenance

The studied species is neither endangered nor protected; the research was approved by the Animal Subjects Research Committee of the Institute of Biology at Irkutsk State University (Protocol #2024/13). A 15-mm-long P. knerii fry was sampled in the Angara River in Irkutsk (52.27769° N, 104.27568° E) in August 2024 and identified according to (Taliev, 1955). It was maintained for 40 days at 15 °C with nauplii of Artemia salina (Linnaeus, 1758) provided as feed, then euthanized (250 μl/l clove oil emulsion), dissected to remove the gut and preserved in RNAlater (Thermo Fisher Scientific) for transport.

RNA sequencing

Total RNA was isolated with a MagMax kit (Thermo Fisher Scientific). After quality control (RNA concentration >100 ng/μL, RIN 7.6), 100 ng of RNA was processed with a TruSeq Stranded mRNA library preparation kit (Illumina), and the library was sequenced with a NovaSeq 6000 device (Illumina) (2 x 101 bp). Demultiplexing of the sequencing reads was performed with bcl2fastq v2.20 (Illumina). Adapters were trimmed with Skewer v0.2.2 (Jiang et al. 2014). RNA extraction, sequencing, and data analysis up to this point were performed by the CeGaT company (Germany). Read quality was analyzed with FastQC v0.11.9 and allowed us to proceed directly with assembly.

Transcriptome assembly

All analyses except indicated otherwise were performed using a small computing cluster (64 Gb RAM, 6 physical cores, 12 virtual cores). Reproducible code for all analyses is available from GitHub (https://github.com/drozdovapb/Paracottus_knerii_transcriptome/).

The main transcriptome assembly used in the subsequent analyses was performed with rnaSPAdes v3.13.1 (Bushmanova et al. 2019) using the option --ss-fr. Additionally, the Oyster River Protocol v2.3.3 was used to compare alternative assemblers (MacManes, 2018). Assembly quality was controlled with BUSCO v5.4.5 (Manni et al. 2021), which uses hmmsearch v3.3 (Eddy 2011) and Metaeuk v6.a5d39d9 (Karin et al. 2020), using the Actinopterygii database (actinopterygii_odb10).

Transcriptome annotation

The correspondence of the sample to the species P. knerii was confirmed by searching the transcriptome assembly for the reference mitochondrial genome sequence of this species (NCBI GenBank ID: MW732164; Mugue et al. 2021) with exonerate v2.4.0 (Slater & Birney 2005). Exonerate is part of Ensembl (Harrison et al. 2024).

The assembly was filtered in multiple steps to only retain the most reliable transcripts belonging to Chordata. First, kentUtils were used to filter by length (>199 bp) and Ns (<9N). The resulting file passed the TRAPID threshold of at most 200,000 sequences and was taxonomically annotated with the TRAPID server (Bucchini et al. 2021), which uses kaiju v1.7.3 (Menzel et al. 2016) for classification. In parallel, protein prediction was performed with TransDecoder v5.7.0 (Haas et al. 2014) with the following non-default options: -m 50 (minimal length of 50 amino acids) and -single-best-only (only one best open reading frame predicted per transcript). The predicted proteome was mapped against the eggNOG v5.0 database (Huerta-Cepas et al. 2019) with the eggNOG-mapper web server v2.1.12 (Cantalapiedra et al. 2021) to obtain functional and additional taxonomic annotation. The transcriptome assembly was filtered so that it contained only sequences identified as Chordata according to either TRAPID or eggNOG-mapper. The Euler diagram describing the intersection in their results was generated with the BioVenn package v1.1.3 (Hulsen 2021).

Next, the NCBI FCS pipeline v0.5.4 (Astashyn et al. 2024) was used to find and filter out potential contamination from the assembly through the public Galaxy server usegalaxy.org (The Galaxy Community 2024; Afgan et al. 2016). The lists were prepared with R v4.1.2 (R Core Team 2021). Filtering was done with seqkit v2.1.0 (Shen et al. 2024). Finally, annotation of the transcriptome assembly according to the best hits in the eggNOG database was prepared with R and appended to each sequence title with seqkit.

Data availability

The final filtered assembly was submitted to NCBI TSA database under the accession number GLBE00000000 (https://www.ncbi.nlm.nih.gov/nuccore/GLBE00000000.1). Raw unfiltered assembly can be found via the GitHub repository (https://clck.ru/3M4zwe). The reads were submitted to NCBI BioProject.

Annotation to the KEGG pathway database

Involvement of transcripts in the immune response was assessed using the eggNOG-mapper annotation. The numbers of transcripts classified as belonging to the KEGG (Kanehisa et al. 2025) pathways related to the immune system (https://www.genome.jp/kegg/pathway.html#organismal) were calculated for the final filtered transcriptome assembly. We used the classification of the pathways according to KEGG release 85.0 in order to match the slightly outdated classification in the eggNOG v5.0 database. The number of transcripts annotated with each pathway of interest was counted with R and visualized with the ggplot2 v3.5.0 package (Wickham 2016).

Results and discussion

Read quality control and assembly

In total, 51M paired reads (10G bases) were obtained for P. knerii fry transcripts. Quality control with FastQC showed good quality (Q>28) and absence of remaining sequence adapters, thus the reads were directly used for transcriptome assembly. The rnaSPAdes assembly had the following BUSCO score: C:80.5%[S:64.2%,D:16.3%],F:5.0%,M:14.5%,n:3640. It is not close to the ideally desired 100% but comparable with other published fish transcriptome assemblies (e.g., Kokkonen et al., 2024) and is suitable for most downstream applications, such as primer design and as a database for mass spectrometric proteome analysis. As it is generally recommended to compare several alternative assemblers (Raghavan et al. 2022), we also performed this step with the Oyster River protocol, which compares four strategies and integrates them into one assembly. However, the best output of the Oyster River protocol was slightly worse (C:74.2%[S:52.2%,D:22.0%],F:6.7%,M:19.1%,n:364), so the rnaSPAdes assembly was used for the downstream analysis.

Taxonomic annotation

To validate species identification, we performed a search for mitochondrial contigs in the transcriptome assembly. The three best hits (GLBE01000028.1, GLBE01000454.1, and GLBE01023288.1) covered >98% of the reference mitochondrial genome of P. knerii with >99% identity, undoubtedly confirming that the sample belonged to this species.

For taxonomic annotation of the assembly, we used TRAPID and eggNOG, and the transcripts classified as belonging to Chordata by these two tools largely coincided (Fig. 1A). TRAPID indicated a substantially higher number of Chordata transcripts, so we focused on this tool to assess the taxonomic annotation of the rest of transcripts (Fig. 1B). We found representatives of multiple taxonomic groups such as bacteria, flatworms, ciliates, viruses etc. in the unfiltered transcriptome assembly, which was expected given the high diversity of symbionts known for this species (Rusinek et al. 2024) and the fact that we used most of the body for RNA extraction. Although the fry gut was dissected to decrease contamination from the food, the relatively high amount of arthropod transcripts (Fig. 1B) indicated that contamination was still the case. Therefore, we cannot exclude the possibility that the identified transcripts of other taxonomic groups were sequenced from intestinal symbionts and not from those located in tissues or on the surface of the skin.

Importantly, 46,439 out of 100,208 transcripts were classified by either of the tools as coming from Chordata, that is, most probably from our object. This set was slightly reduced by the FCS pipeline to 46,413 transcripts due to possible contamination, and the resulting assembly was submitted to NCBI.

The diversity of immunity-related transcripts

To evaluate the diversity of functional proteins in the filtered transcriptome assembly, we concentrated on the transcripts involved in the immune response. For this, we used annotation of all vertebrate transcripts according to eggNOG-mapper and calculated how many of them belong to KEGG pathways related to the immune system (Fig. 2). The results showed that the transcriptome of P. knerii includes from dozens to hundreds of transcripts related to each pathway. These numbers are comparable to the output of a similar analysis performed earlier for the Betta splendens transcriptome (Amparyup et al. 2020), and indicate that the obtained transcriptome assembly can be useful in search for immunity-related and probably other functional proteins of Lake Baikal sculpins.

Figure 1. Taxonomic annotation of the transcripts obtained. (A) Euler diagram showing intersection between the sets of transcripts annotated as Chordata by TRAPID and eggNOG-mapper. (B) The numbers of transcripts belonging to a manually selected set of taxonomic groups according to TRAPID.
Figure 1
Figure 2. Numbers of transcripts involved in immune system-related KEGG pathways according to the annotation by eggNOG-mapper. The analysis was applied to the final filtered transcriptome assembly, i.e. only to the sequences clearly belonging to Chordata.
Figure 2

Conclusion

Here we presented the transcriptome assembly of a young stone sculpin P. knerii as the first exome-wide resource for searching functional proteins for the endemic diversity of Cottidae of Lake Baikal. The assembly published in NCBI was filtered to contain only transcripts most probably belonging to the host, but available raw reads can also be used for the search of sequences of the fish symbionts. Despite dissection to remove the gut, the reads probably contain RNA from the content of the digestive system, and the exact location of the symbionts in the fish body is unclear. Importantly, the obtained assembly contains a wide diversity of transcripts related to the immune system and can be used in physiological studies investigating the immune response of Lake Baikal sculpins.

Acknowledgements

The study was supported by the Russian Science Foundation within project #24-74-00095 (https://rscf.ru/en/project/24-74-00095/).

References