EGA - European Genome-Phenome Archive
The EGA provides a service for the permanent archiving and distribution of personally identifiable genetic and phenotypic data resulting from biomedical research projects. Data at EGA was collected from individuals whose consent agreements authorise data release only for specific research use to bona fide researchers. Strict protocols govern how information is managed, stored and distributed by the EGA project. The European Genome-phenome Archive (EGA) is jointly managed by the European Bioinformatics Institute (EBI) and the Centre for Genomic Regulation (CRG). Studies and datasets can be browsed on the public website. Each study is assigned a stable accession that may be referenced in publications. The EGA implements a controlled access policy whereby the access decisions resides with the Data Access Committee (DAC). The EGA accepts only de-identified data with a DAC approved access plan. The accepted data types include raw data formats from the array-based and new sequencing platforms as well as phenotype files describing study samples.
Plataformas de conexión entre herramientas
Virtual research environments (VREs) are online analysis platforms that offer researchers access to relevant data repositories, analysis pipelines and visualizers in an seamless manner. openVRE is a open framework to rapidly prototype these computational platforms on top of a cloud infrastructure. The framework permits to easily plug-in analysis tools and workflows in a modular way in order to eventually offer to the researcher a comprehensive tooling catalog. VEIS analysis methods conforms the basis of this catalogue. openVRE also facilitates the access to external data repositories. The European Genome-Phenome Archive (EGA) is one of the data infrastructures openVRE-based platforms have access to in a secure and efficient way. The private genomic and phenotypic datasets of the researcher are only accessed and on-the-fly decrypted when being consumed by one of the analysis tools integrated in the system.
WfExS-backend is a high-level workflow execution command line program that consumes and creates RO-Crates, focusing on the interconnection of content-sensitive research infrastructures for handling sensitive human data analysis scenarios. WfExS-backend delegates workflow execution of existing workflow engines, and it is designed to facilitate more secure and reproducible workflow executions to promote analysis reproducibility and replicability. Secure executions are achieved using FUSE encrypted directories for non-disclosable inputs, intermediate workflow execution results and output files. RO-Crates are, indeed, an element of knowledge transfer between repeated workflow executions. WfExS-backend stores all the gathered details, output metadata and execution provenance in the output RO-Crate to achieve future reproducible executions. Final execution results can be encrypted with crypt4gh GA4GH standard using the public keys of the target researchers or destination, so the results can be safely moved outside the execution environments through unsecured networks and storage.
The RD-Connect GPAP is an IRDIRC recognised resource developed by the CNAG-CRG for secure data sharing and analysis. The platform provides a user-friendly tool for clinicians and clinical researchers to analyse pseudonymised exome and genome sequencing data from patients with rare diseases integrated with clinical information for diagnosis and gene discovery. Data is also discoverable and shareable with other authorised users from the system and the broader scientific community as part of the Beacon Network and Matchmaker-Exchange, in a privacy-preserving and controlled access environment (GDPR compliant). The RD-Connect GPAP currently hosts phenotypic and processed genomic data (annotated genetic variants) from more than 20.000 patients and relatives and has been used to diagnose hundreds of patients. The corresponding raw files and phenotypic information are deposited upon submitter authorisation at the EGA for long term controlled access and data re-use.
Rbbt is a framework for software development in bioinformatics standing for “Ruby Bioinformatics Toolkit”. This integral solution for genomics features tools that form the basis of most bioinformatics tasks such as parsing and tidying data, gathering and setting up resources like software tools and databases, organizing the sequential production of results for reusable/reproducible work, producing sharable reports and packaging interoperable functionalities into pluggable modules. Many of the features of Rbbt are organized around four large subsystems: Workflows, TSV files, Resource management, HTML and REST. The Rbbt framework provides incentives to adhere to several reasonable standards that improve reusability and interoperability.
The Galaxy project is an open source workflow engine that aims to make computational biology accessible to research scientists without programming or systems adminstration background. Galaxy was originally developed for genomics data analysis, but it has evolved into a domain agnostic platform, being used by researchers in chemoinformatics or climate prediction, etc.The platform ensures reproducibility by capturing inputs, parameters, step order and datasets(workflows' intermediate results as well as final results) and makes it possible to share the results publicly or on an individual basis. On top of that, Galaxy supports data uploads from several sources (the user’s computer, by URL or directly from many online resources).In the framework of VEIS project, all these tasks can be carried out in the platform since it integrates a huge number of tools that can be interconnected by the creation of different workflows to process the data from hospitals and other research facilities.
Barcelona Computational Biomedical Platform (BCBP)
Annotates variants with biological data such as protein structural information, functionally important residues, conservation of functional domains and evidence of cross-species conservation.
nfcore/viralrecon is a bioinformatics analysis pipeline used to perform assembly and intra-host/low-frequency variant calling for viral samples. The pipeline supports short-read Illumina sequencing data from both shotgun (e.g. sequencing directly from clinical samples) and enrichment-based library preparation methods (e.g. amplicon-based: ARTIC SARS-CoV-2 enrichment protocol; or probe-capture-based).
BoostDM is a method to score all possible point mutations (single base substitutions) in cancer genes for their potential to be involved in tumorigenesis. The method is based on the analysis of observed mutations in sequenced tumors and their site-by-site annotation with relevant features. The compendium of cancer genes and the mutational features for each cancer gene across malignancies have been derived from the systematic analysis of tens of thousands of tumor samples (www.intogen.org). Other relevant features have been collected from public databases.
IntOGen is a framework for automatic and comprehensive knowledge extraction based on mutational data from sequenced tumor samples from patients. The framework identifies cancer genes and pinpoints their putative mechanism of action across tumor types.
Cancer Genome Interpreter (CGI) is designed to support the identification of tumor alterations that drive the disease and detect those that may be therapeutically actionable. CGI relies on existing knowledge collected from several resources and on computational methods that annotate the alterations in a tumor according to distinct levels of evidence.
The Chemical Checker (CC) is a resource that provides processed, harmonized and integrated bioactivity data on 800,000 small molecules. The CC divides data into five levels of increasing complexity, ranging from the chemical properties of compounds to their clinical outcomes. In between, it considers targets, off-targets, perturbed biological networks and several cell-based assays such as gene expression, growth inhibition and morphological profilings. In the CC, bioactivity data are expressed in a vector format, which naturally extends the notion of chemical similarity between compounds to similarities between bioactivity signatures of different kinds.
DisGeNET is a discovery platform containing one of the largest publicly available collections of genes and variants associated to human diseases. DisGeNET integrates data from expert curated repositories, GWAS catalogues, animal models and the scientific literature. DisGeNET data are homogeneously annotated with controlled vocabularies and community-driven ontologies. Additionally, several original metrics are provided to assist the prioritization of genotype–phenotype relationships. The current version of DisGeNET (v7.0) contains 1,134,942 gene-disease associations (GDAs), between 21,671 genes and 30,170 diseases, disorders, traits, and clinical or abnormal human phenotypes, and 369,554 variant-disease associations (VDAs), between 194,515 variants and 14,155 diseases, traits, and phenotypes.
An open-source and fast classifier for predicting the impact of mutations in protein–protein complexes
Nextflow pipeline for splicing quantitative trait loci (sQTL) mapping based on sQTLseekeR2. The pipeline performs the following analysis steps: 1) Index the genotype file, 2) Preprocess the transcript expression data, 3) Test for association between splicing ratios (a multivariate phenotype) and genetic variants in cis (nominal pass), 4) Obtain an empirical P-value for each phenotype (permutation pass, optional), 5) Control for multiple testing.
Command-line tool for the visualization of splicing events across multiple samples
FOLD-X is a program for calculating the folding energies of proteins and for calculating the effect of a point mutation on the stability of a protein.
An online tool for diagnosis and gene discovery in rare disease research. The platform features allow identifying disease-causing mutations in rare disease patients and linking them with detailed clinical information.
Nextflow enables scalable and reproducible scientific workflows using software containers. It allows the adaptation of pipelines written in the most common scripting languages.
PyHIST is a Histological Image Segmentation Tool: a lightweight semi-automatic pipeline to extract tiles with foreground content from SVS histopathology whole image slides (with experimental support for other formats). It is intended to be an easy-to-use tool to preprocess histological image data for usage in machine learning tasks. The PyHIST pipeline involves three main steps: 1) produce a mask for the input WSI that differentiates the tissue from the background, 2) create a grid of tiles on top of the mask, evaluate each tile to see if it meets the minimum content threshold to be considered as foreground and 3) extract the selected tiles from the input WSI at the requested resolution.
Web portal for the annotation of pathological protein variants.
EPICO / BLUEPRINT Data Analysis Portal, data loading scripts and API
The widespread incorporation of next-generation sequencing into clinical oncology has yielded an unprecedented amount of molecular data from thousands of patients. A main current challenge is to find out reliable ways to extrapolate results from one group of patients to another and to bring rationale to individual cases in the light of what is known from the cohorts. We present OncoGenomic Landscapes, a framework to analyze and display thousands of cancer genomic profiles in a 2D space. Our tool allows users to rapidly assess the heterogeneity of large cohorts, enabling the comparison to other groups of patients, and using driver genes as landmarks to aid in the interpretation of the landscapes. In our web-server, we also offer the possibility of mapping new samples and cohorts onto 22 predefined landscapes related to cancer cell line panels, organoids, patient-derived xenografts, and clinical tumor samples.
The massive molecular profiling of thousands of cancer patients has led to the identification of many tumor type specific driver genes. However, only a few (or none) of them are present in each individual tumor and, to enable precision oncology, we need to interpret the alterations found in a single patient. Cancer PanorOmics (http://panoromics.irbbarcelona.org) is a web-based resource to contextualize genomic variations detected in a personal cancer genome within the body of clinical and scientific evidence available for 26 tumor types, offering complementary cohort and patient-centric views. Additionally, it explores the cellular environment of mutations by mapping them on the human interactome and providing quasiatomic structural details, whenever available. This ‘PanorOmic’ molecular view of individual tumors should contribute to the identification of actionable alterations ultimately guiding the clinical decision-making process.
Network-centered approaches are increasingly used to understand the fundamentals of biology. However, the molecular details contained in the interaction networks, often necessary to understand cellular processes, are very limited, and the experimental difficulties surrounding the determination of protein complex structures make computational modeling techniques paramount. Interactome3D is a resource for the structural annotation and modeling of protein-protein interactions. Through the integration of interaction data from the main pathway repositories, we provide structural details at atomic resolution for over 12,000 protein-protein interactions in eight model organisms. Unlike static databases, Interactome3D also allows biologists to upload newly discovered interactions and pathways in any species, select the best combination of structural templates and build three-dimensional models in a fully automated manner.
Considering pathological genetic variants within the context of the human interactome network can help understanding the intricate genotype-to-phenotype relationships behind human diseases. It allows, for instance, to distinguish between changes that totally suppress a gene product (i.e. node removal) from the ones that might affect only one of its functions, modulating the way in which the protein interacts with its partners (i.e. edge-specific or edgetic1). dSysMap is a resource for the systematic mapping of disease-related missense mutations on the structurally annotated binary human interactome from Interactome3D.
R implementation of GADA (Genetic Alteration Detector Algorithm) - used to detect CNVs from aCGH and intensity array SNP data
Mosaic Alteration Detector
invClust is a method to detect inversion-related haplotypes in SNP data
Rbbt stands for “Ruby Bioinformatics Toolkit”. It is a framework for software development in bioinformatics. It covers three aspects:. Workflow wrapping the TF text-mining results from https://github.com/fnl plus other databases. High Throughput Sequencing related functionalities. Rbbt wrapper for the Variant Effect Predictor. Auto-downloads and installs the software.
Workflow for variant calling and other functionalities
nf-core/chipseq is a bioinformatics analysis pipeline used for Chromatin ImmunopreciPitation sequencing (ChIP-seq) data.
nf-core/rnaseq is a bioinformatics analysis pipeline used for RNA sequencing data.
nf-core/smrnaseq is a bioinformatics best-practice analysis pipeline used for small RNA sequencing data.
nfcore/atacseq is a bioinformatics analysis pipeline used for ATAC-seq data.
nf-core/cageseq is a bioinformatics analysis pipeline used for CAGE-seq sequencing data.
nf-core/methylseq is a bioinformatics analysis pipeline used for Methylation (Bisulfite) sequencing data. It pre-processes raw data from FastQ inputs, aligns the reads and performs extensive quality-control on the results.
This pipeline is based on the HiC-Pro workflow. It was designed to process Hi-C data from raw FastQ files (paired-end Illumina data) to normalized contact maps. The current version supports most protocols, including digestion protocols as well as protocols that do not require restriction enzymes such as DNase Hi-C. In practice, this workflow was successfully applied to many data-sets including dilution Hi-C, in situ Hi-C, DNase Hi-C, Micro-C, capture-C, capture Hi-C or HiChip data.
The MuG Virtual Research Environment is an analysis platform for 3D/4D genomics analyses. It integrates genomics tools for chromatin dynamics data.
Package that allows to explore the exposome and to perform association analyses between exposures and health outcomes.
Comparative Toxicogenomics Database data extraction, visualization and enrichment of environmental and toxicological studies.
Estimate chronological and gestational DNA methylation (DNAm) age as well as biological age using different methylation clocks
recombClust is a R package that groups chromosomes by their recombination history. Recombination history is based on a mixture model that, given a pair of SNP-blocks, separates chromosomes in two populations, one with high Linkage Disequilibrium (LD) and low recombination (linkage) and another with low LD and high recombination. The method use the classification of several SNP-block pairs in a region to group chromosomes in clusters with different recombination history. This package takes as input genotype phased data.
Robust detection of mosaic loss of chromosome Y from genotype-array-intensity data | R package to detect mosaic loss of Y events in SNP array and NGS data | R package to detect mosaic loss of Y events (LOY) from SNP array data
Integrated platform connecting databases, registries, biobanks and clinical bioinformatics for rare disease research.
Computational framework to analyze and model 3C-based experiments.
Collection of protein interactions for which high-resolution three-dimensional structures are known. The interface residues are presented for each interaction type individually, plus global domain interfaces at which one or more partners (domains or peptides) bind. The web server visualizes these interfaces along with atomic details of individual interactions using Jmol.
It can get the samples' inversion status of known inversions. It uses SNP data as input and requires the following information about the inversion: genotype frequencies in the different haplotypes, R2 between the region SNPs and inversion status and heterozygote genotypes in the reference. The package include this data for two well known inversions and for two additional regions.
Package to find genetic inversions in genotype (SNP array) data.
OpenEBench is a platform designed to facilitate the work of scientific communities such as VEIS that come together to address the challenges they have in their domain and decide to carry out a scientific benchmarking of their methods and tools. In this way, OpenEBench benefits communities but also other scientists, by offering a reference point where this data can be accessed so that they can make informed decisions. This includes both software developers, who will be able to develop more efficient methods by comparing their results with those of other similar resources, and researchers, who usually find it difficult to choose the right tool for the challenge in question they want to solve.