EGA - European Genome-Phenome Archive

EGA

La EGA proporciona un servicio para el archivo permanente y la distribución de datos genéticos y fenotípicos personales identificables resultantes de proyectos de investigación biomédica. Los datos en EGA se recopilaron de personas cuyos acuerdos de consentimiento autorizan la publicación de datos solo para uso específico de investigación para investigadores de certificados. Existen protocolos estrictos que determinan cómo se administra, almacena y distribuye la información es por parte del proyecto EGA. El Archivo Europeo de Genoma-Fenoma (EGA) está gestionado conjuntamente por el European Bioinformatics Institute (EBI) y el Centro de Regulación Genómica (CRG). Los estudios y conjuntos de datos se pueden explorar en la página web pública. A cada estudio se le asigna un identificador estable que pueda ser referenciado en publicaciones. La EGA implementa una política de acceso controlado por la cual las decisiones de acceso residen en el Comité de Acceso a Datos (DAC). La EGA solo acepta datos anonimizados con un plan de acceso aprobado por el DAC. Los tipos de datos aceptados incluyen formatos de datos sin procesar de las plataformas basadas en arrays o de nueva secuenciación, así como archivos de fenotipo describiendo muestras de estudio.

Plataformas de conexión entre herramientas

VRE

Los entornos de investigación virtual (VREs) son plataformas de análisis en línea que ofrecen a los investigadores acceso a repositorios de datos, herramientas de análisis y visualizadores de forma integrada y transparente. openVRE es un marco de trabajo abierto para crear rápidamente un prototipo de estas plataformas computacionales sobre una infraestructura en la nube. El marco permite integrar fácilmente herramientas de análisis de forma modular, a fin de ofrecer al investigador un catálogo consolidado de herramientas. Los métodos de análisis de VEIS son un punto de partida para construir este catálogo. openVRE también facilita el acceso a repositorios de datos externos. El European Genome-Phenome Archive (EGA) es una de las infraestructuras de datos a las que las plataformas basadas en openVRE tienen acceso de forma segura y eficiente. Solo se accede y se descifran los datos genómicos y fenotípicos privados del investigador cuando éstos son consumidos por alguna de las herramientas de análisis integradas en el sistema.

WfExS

WfExS-backend es un programa de línea de comandos que consume y crea RO-Crates para la ejecución de alto nivel de workflows, en un escenario de análisis de datos humanos centrado en la interconexión de infraestructuras de investigación con datos sensibles. WfExS-backend delega la ejecución de los workflows en motores de ejecución de workflows existentes, y está diseñado para facilitar que las ejecuciones sean más seguras y reproducibles, fomentando la reproducibilidad y replicabilidad de los análisis. Las ejecuciones seguras se consiguen usando directorios encriptados basados en la tecnología FUSE para almacenar ficheros de entrada de contenido no divulgable, resultados intermedios de la ejecución del workflow y ficheros de salida. Los RO-Crate son en sí un elemento de transferencia de conocimiento entre distintas ejecuciones de un workflow. WfExS-backend almacena todos los detalles recopilados, metadatos de salida y la traza de ejecución en el RO-Crate de salida para lograr la reproducibilidad en las ejecuciones futuras. Los resultados finales de la ejecución se pueden encriptar con crypt4gh, estándar de GA4GH, usando las claves públicas de los investigadores receptores o del sistema destinatario, permitiendo que los resultados se puedan trasladar fuera de los ambientes de ejecución usando redes y almacenamientos inseguros.

RD-Connect Genome-Phenome Analysis Platform

RD-Connect GPAP es una herramienta desarrollada por el CNAG-CRG y reconocida por IRDIRC para compartir y analizar datos de forma segura. La plataforma proporciona herramientas fáciles de usar para que los médicos y los investigadores clínicos analicen datos seudonimizados de secuenciación de genoma y de exoma de pacientes con enfermedades raras integrados con información clínica y fenotípica para el diagnóstico y el descubrimiento de genes. Los datos también se pueden descubrir y compartir con otros usuarios autorizados del sistema y la comunidad científica en general (sistema integrado con el Beacon-Network y Matchmaker-Exchange) en un entorno seguro, de acceso controlado y que preserva la privacidad (cumple con el RGPD). RD-Connect GPAP actualmente alberga datos fenotípicos y genómicos procesados ​​(variantes genéticas anotadas) de más de 20.000 pacientes y familiares y se ha utilizado para diagnosticar a cientos de pacientes. Los archivos crudos (sin procesar) correspondientes junto con la información fenotípica se depositan bajo autorización del remitente en la EGA para el acceso controlado a largo plazo y la reutilización de datos.

Rbbt

Rbbt es un marco para el desarrollo de software en bioinformática que significa “Ruby Bioinformatics Toolkit”. Esta solución integral para genómica presenta herramientas que forman la base de la mayoría de las tareas bioinformáticas, como analizar y ordenar datos, recopilar y configurar recursos como herramientas de software y bases de datos, organizar la producción secuencial de resultados para trabajos reutilizables/reproducibles, producir informes que se puedan compartir y empaquetar funcionalidades interoperables en módulos conectables. Muchas de las funciones de Rbbt se organizan en torno a cuatro grandes subsistemas: flujos de trabajo, archivos TSV, gestión de recursos, HTML y REST. El marco Rbbt proporciona incentivos para adherirse a varios estándares razonables que mejoran la reutilización y la interoperabilidad.

UseGalaxy

El proyecto Galaxy es un motor de workflows cuyo propósito es hacer que la biología computacional sea accesible a científicos investigadores que no tengan experiencia en programación o en administración de sistemas. Galaxy fue desarrollado en su origen para el análisis de datos genómicos, pero ha evolucionado para convertirse en un plataforma agnóstica, siendo usada por investigadores en quimioinformática, predicción del clima, etc. Para garantizar la reproducibilidad, la plataforma captura los diferentes parámetros de entrada, el orden de ejecución de los pasos de cada workflow así como los datasets (tanto intermedios como finales) y permite compartir públicamente o a nivel individual el resultado de ejecución de herramientas y workflows. Además, Galaxy soporta la subida de datos desde diferentes fuentes (el ordenador del usuario, por URL o desde diferentes recursos online). En el marco de trabajo del proyecto VEIS, todas estas tareas pueden llevarse a cabo en la plataforma ya que integra un gran número de herramientas que pueden interconectarse creando diferente workflows para procesar los datos de hospitales y otros centros de investigación.

Barcelona Computational Biomedical Platform (BCBP)

APPRIS

Annotates variants with biological data such as protein structural information, functionally important residues, conservation of functional domains and evidence of cross-species conservation.

nf-core-viralrecon

nfcore/viralrecon is a bioinformatics analysis pipeline used to perform assembly and intra-host/low-frequency variant calling for viral samples. The pipeline supports short-read Illumina sequencing data from both shotgun (e.g. sequencing directly from clinical samples) and enrichment-based library preparation methods (e.g. amplicon-based: ARTIC SARS-CoV-2 enrichment protocol; or probe-capture-based).

BoostDM

BoostDM is a method to score all possible point mutations (single base substitutions) in cancer genes for their potential to be involved in tumorigenesis. The method is based on the analysis of observed mutations in sequenced tumors and their site-by-site annotation with relevant features. The compendium of cancer genes and the mutational features for each cancer gene across malignancies have been derived from the systematic analysis of tens of thousands of tumor samples (www.intogen.org). Other relevant features have been collected from public databases.

Intogen

IntOGen is a framework for automatic and comprehensive knowledge extraction based on mutational data from sequenced tumor samples from patients. The framework identifies cancer genes and pinpoints their putative mechanism of action across tumor types.

Cancer Genome Interpreter (CGI)

Cancer Genome Interpreter (CGI) is designed to support the identification of tumor alterations that drive the disease and detect those that may be therapeutically actionable. CGI relies on existing knowledge collected from several resources and on computational methods that annotate the alterations in a tumor according to distinct levels of evidence.

Chemical Checker

The Chemical Checker (CC) is a resource that provides processed, harmonized and integrated bioactivity data on 800,000 small molecules. The CC divides data into five levels of increasing complexity, ranging from the chemical properties of compounds to their clinical outcomes. In between, it considers targets, off-targets, perturbed biological networks and several cell-based assays such as gene expression, growth inhibition and morphological profilings. In the CC, bioactivity data are expressed in a vector format, which naturally extends the notion of chemical similarity between compounds to similarities between bioactivity signatures of different kinds.

DisGeNET

DisGeNET is a discovery platform containing one of the largest publicly available collections of genes and variants associated to human diseases. DisGeNET integrates data from expert curated repositories, GWAS catalogues, animal models and the scientific literature. DisGeNET data are homogeneously annotated with controlled vocabularies and community-driven ontologies. Additionally, several original metrics are provided to assist the prioritization of genotype–phenotype relationships. The current version of DisGeNET (v7.0) contains 1,134,942 gene-disease associations (GDAs), between 21,671 genes and 30,170 diseases, disorders, traits, and clinical or abnormal human phenotypes, and 369,554 variant-disease associations (VDAs), between 194,515 variants and 14,155 diseases, traits, and phenotypes.

UEP

An open-source and fast classifier for predicting the impact of mutations in protein–protein complexes

sqtlseeker2-nf

Nextflow pipeline for splicing quantitative trait loci (sQTL) mapping based on sQTLseekeR2. The pipeline performs the following analysis steps: 1) Index the genotype file, 2) Preprocess the transcript expression data, 3) Test for association between splicing ratios (a multivariate phenotype) and genetic variants in cis (nominal pass), 4) Obtain an empirical P-value for each phenotype (permutation pass, optional), 5) Control for multiple testing.

ggsashimi

Command-line tool for the visualization of splicing events across multiple samples

FoldX

FOLD-X is a program for calculating the folding energies of proteins and for calculating the effect of a point mutation on the stability of a protein.

RD-Connect Genome-Phenome Analysis Platform (GPAP)

An online tool for diagnosis and gene discovery in rare disease research. The platform features allow identifying disease-causing mutations in rare disease patients and linking them with detailed clinical information.

Nextflow

Nextflow enables scalable and reproducible scientific workflows using software containers. It allows the adaptation of pipelines written in the most common scripting languages.

PyHIST

PyHIST is a Histological Image Segmentation Tool: a lightweight semi-automatic pipeline to extract tiles with foreground content from SVS histopathology whole image slides (with experimental support for other formats). It is intended to be an easy-to-use tool to preprocess histological image data for usage in machine learning tasks. The PyHIST pipeline involves three main steps: 1) produce a mask for the input WSI that differentiates the tissue from the background, 2) create a grid of tiles on top of the mask, evaluate each tile to see if it meets the minimum content threshold to be considered as foreground and 3) extract the selected tiles from the input WSI at the requested resolution.

PMut

Web portal for the annotation of pathological protein variants.

epico-data-analysis-portal

EPICO / BLUEPRINT Data Analysis Portal, data loading scripts and API

OncogenomicLandscapes

The widespread incorporation of next-generation sequencing into clinical oncology has yielded an unprecedented amount of molecular data from thousands of patients. A main current challenge is to find out reliable ways to extrapolate results from one group of patients to another and to bring rationale to individual cases in the light of what is known from the cohorts. We present OncoGenomic Landscapes, a framework to analyze and display thousands of cancer genomic profiles in a 2D space. Our tool allows users to rapidly assess the heterogeneity of large cohorts, enabling the comparison to other groups of patients, and using driver genes as landmarks to aid in the interpretation of the landscapes. In our web-server, we also offer the possibility of mapping new samples and cohorts onto 22 predefined landscapes related to cancer cell line panels, organoids, patient-derived xenografts, and clinical tumor samples.

PanorOmics

The massive molecular profiling of thousands of cancer patients has led to the identification of many tumor type specific driver genes. However, only a few (or none) of them are present in each individual tumor and, to enable precision oncology, we need to interpret the alterations found in a single patient. Cancer PanorOmics (http://panoromics.irbbarcelona.org) is a web-based resource to contextualize genomic variations detected in a personal cancer genome within the body of clinical and scientific evidence available
for 26 tumor types, offering complementary cohort and patient-centric views. Additionally, it explores the cellular environment of mutations by mapping them on the human interactome and providing quasiatomic structural details, whenever available. This ‘PanorOmic’ molecular view of individual tumors should contribute to the identification of actionable alterations ultimately guiding the clinical decision-making process.

Interactome3D

Network-centered approaches are increasingly used to understand the fundamentals of biology. However, the molecular details contained in the interaction networks, often necessary to understand cellular processes, are very limited, and the experimental difficulties surrounding the determination of protein complex structures make computational modeling techniques paramount. Interactome3D is a resource for the structural annotation and modeling of protein-protein interactions. Through the integration of interaction data from the main pathway repositories, we provide structural details at atomic resolution for over 12,000 protein-protein interactions in eight model organisms. Unlike static databases, Interactome3D also allows biologists to upload newly discovered interactions and pathways in any species, select the best combination of structural templates and build three-dimensional models in a fully automated manner.

dSysMap

Considering pathological genetic variants within the context of the human interactome network can help understanding the intricate genotype-to-phenotype relationships behind human diseases. It allows, for instance, to distinguish between changes that totally suppress a gene product (i.e. node removal) from the ones that might affect only one of its functions, modulating the way in which the protein interacts with its partners (i.e. edge-specific or edgetic1). dSysMap is a resource for the systematic mapping of disease-related missense mutations on the structurally annotated binary human interactome from Interactome3D.

R-GADA

R implementation of GADA (Genetic Alteration Detector Algorithm) - used to detect CNVs from aCGH and intensity array SNP data

MAD

Mosaic Alteration Detector

invClust

invClust is a method to detect inversion-related haplotypes in SNP data

Rbbt

Rbbt stands for “Ruby Bioinformatics Toolkit”. It is a framework for software development in bioinformatics. It covers three aspects:. Workflow wrapping the TF text-mining results from https://github.com/fnl plus other databases. High Throughput Sequencing related functionalities. Rbbt wrapper for the Variant Effect Predictor. Auto-downloads and installs the software.

Rbbt-HTS

Workflow for variant calling and other functionalities

nf-core-chipseq

nf-core/chipseq is a bioinformatics analysis pipeline used for Chromatin ImmunopreciPitation sequencing (ChIP-seq) data.

nf-core-rnaseq

nf-core/rnaseq is a bioinformatics analysis pipeline used for RNA sequencing data.

nf-core-smrnaseq

nf-core/smrnaseq is a bioinformatics best-practice analysis pipeline used for small RNA sequencing data.

nf-core-atacseq

nfcore/atacseq is a bioinformatics analysis pipeline used for ATAC-seq data.

nf-core-cageseq

nf-core/cageseq is a bioinformatics analysis pipeline used for CAGE-seq sequencing data.

nf-core-methylseq

nf-core/methylseq is a bioinformatics analysis pipeline used for Methylation (Bisulfite) sequencing data. It pre-processes raw data from FastQ inputs, aligns the reads and performs extensive quality-control on the results.

nf-core-hic

This pipeline is based on the HiC-Pro workflow. It was designed to process Hi-C data from raw FastQ files (paired-end Illumina data) to normalized contact maps. The current version supports most protocols, including digestion protocols as well as protocols that do not require restriction enzymes such as DNase Hi-C. In practice, this workflow was successfully applied to many data-sets including dilution Hi-C, in situ Hi-C, DNase Hi-C, Micro-C, capture-C, capture Hi-C or HiChip data.

MuGVRE

The MuG Virtual Research Environment is an analysis platform for 3D/4D genomics analyses. It integrates genomics tools for chromatin dynamics data.

rexposome

Package that allows to explore the exposome and to perform association analyses between exposures and health outcomes.

CTDquerier

Comparative Toxicogenomics Database data extraction, visualization and enrichment of environmental and toxicological studies.

methylclock

Estimate chronological and gestational DNA methylation (DNAm) age as well as biological age using different methylation clocks

recombClust

recombClust is a R package that groups chromosomes by their recombination history. Recombination history is based on a mixture model that, given a pair of SNP-blocks, separates chromosomes in two populations, one with high Linkage Disequilibrium (LD) and low recombination (linkage) and another with low LD and high recombination. The method use the classification of several SNP-block pairs in a region to group chromosomes in clusters with different recombination history. This package takes as input genotype phased data.

MADloy

Robust detection of mosaic loss of chromosome Y from genotype-array-intensity data | R package to detect mosaic loss of Y events in SNP array and NGS data | R package to detect mosaic loss of Y events (LOY) from SNP array data

RD-connect

Integrated platform connecting databases, registries, biobanks and clinical bioinformatics for rare disease research.

TADbit

Computational framework to analyze and model 3C-based experiments.

3DID database of 3D interacting domains

Collection of protein interactions for which high-resolution three-dimensional structures are known. The interface residues are presented for each interaction type individually, plus global domain interfaces at which one or more partners (domains or peptides) bind. The web server visualizes these interfaces along with atomic details of individual interactions using Jmol.

scoreInvHap

It can get the samples' inversion status of known inversions. It uses SNP data as input and requires the following information about the inversion: genotype frequencies in the different haplotypes, R2 between the region SNPs and inversion status and heterozygote genotypes in the reference. The package include this data for two well known inversions and for two additional regions.

inveRsion

Package to find genetic inversions in genotype (SNP array) data.

Benchmarking

OpenEBench

OpenEBench es una plataforma diseñada para facilitar el trabajo de las comunidades científicas como VEIS que se unen para abordar los retos que tienen en su dominio y deciden llevar adelante el benchmarking científico de sus métodos y herramientas. De esta forma, OpenEBench beneficia a las comunidades pero también al resto de los científicos, al ofrecer un lugar de referencia donde se pueden acceder a estos datos y se pueden tomar decisiones informadas. Esto incluye tanto a desarrolladores de software, que podrán desarrollar métodos más eficientes al comprar sus resultados con los de otros recursos similares, como investigadores, que tienden a tener dificultades para elegir la herramienta adecuada para el reto en cuestión que quieren resolver.