EGA - European Genome-Phenome Archive

EGA

L’EGA proporciona un servei per a l'arxiu permanent i la distribució de dades genètiques i fenotípiques personals identificables resultants de projectes d’investigació biomèdica. Les dades d’EGA es recopilen de persones els acords de consentiment de les quals, autoritzen la publicació de les seves dades només per a ús específic de recerca per a investigadors certificats. Hi ha protocols estrictes que determinen com s’administra, emmagatzema i distribueix la informació per part del projecte EGA. L’Arxiu Europeu de Genoma Fenoma (EGA) està gestionat conjuntament pel European Bioinformatics Institute (EBI) i el Centre de Regulació Genòmica (CRG). Els estudis i conjunts de dades es poden explorar a la seva pàgina web pública. A cada estudi se li assigna un identificador estable que pugui ser referenciat en publicacions. L’EGA implementa una política d’accés controlat per la qual les decisions d’accés resideixen al Comitè d’Accés a Dades (DAC). L’EGA només accepta dades anonimitzades amb un pla d’accés aprovat pel DAC. Els tipus de dades acceptats inclouen formats de dades sense processar de les plataformes basades en arrays o de nova seqüenciació, així com arxius de fenotip descrivint mostres d’estudi.

Plataformes de connexió entre eines

VRE

Els entorns de recerca virtual (VREs) són plataformes d’anàlisi en línia que ofereixen als investigadors accés a repositoris de dades, eines d’anàlisi i visualitzadors de forma integrada i transparent. openVRE és un marc de treball obert per crear ràpidament un prototip d’aquestes plataformes computacionals sobre una infraestructura en el núvol. El marc permet integrar fàcilment eines d’anàlisi de forma modular, per tal d’oferir a l’investigador un catàleg consolidat d’eines. Els mètodes d’anàlisi de VEIS són un punt de partida per construir aquest catàleg. openVRE també facilita l’accés a repositoris de dades externs. L’Arxiu Europeu de Genoma Fenoma (EGA) és una de les infraestructures de dades a les que les plataformes basades en openVRE tenen accés de forma segura i eficient. Només s’accedeix i es desxifren les dades genòmiques i fenotípiquescs privades de l’investigador quan aquestes són consumides per alguna de les eines d’anàlisi integrades en el sistema.

WfExS

WfExS-backend és un programa de línia de comandes que consumeix i crea RO-Crates per a l'execució d’alt nivell de workflows, en un escenari d’anàlisi de dades humans centrat en la interconnexió d’infraestructures de recerca amb dades sensibles. WfExS-backend delega l’execució dels workflows en motors d’execució de workflows existents, i està dissenyat per facilitar que les execucions siguin més segures i reproduïbles, fomentant la reproductibilitat i replicabilitat de les anàlisis. Les execucions segures s’aconsegueixen usant directoris encriptats basats en la tecnologia FUSE per emmagatzemar fitxers d’entrada de contingut no divulgable, resultats intermedis de l’execució del workflow i fitxers de sortida. Els RO-Crate són en si un element de transferència de coneixement entre diferents execucions d’un workflow. WfExS-backend emmagatzema tots els detalls recopilats, metadades de sortida i la traça d’execució en el RO-Crate de sortida per aconseguir la reproductibilitat en les execucions futures. Els resultats finals de l’execució es poden xifrar amb crypt4gh, estàndard de GA4GH, usant les claus públiques dels investigadors receptors o de sistema destinatari, permetent que els resultats es puguin traslladar fora dels ambients d’execució usant xarxes i emmagatzematges insegurs.

RD-Connect Genome-Phenome Analysis Platform

RD-Connect GPAP és una eina desenvolupada pel CNAG-CRG i reconeguda per IRDIRC per compartir i analitzar dades de forma segura. La plataforma proporciona eines fàcils d’utilitzar perquè els metges i els investigadors clínics analitzin dades seudonimizados de seqüenciació de genoma i de exoma de pacients amb malalties rares integrats amb informació clínica i fenotípica per al diagnòstic i el descobriment de gens. Les dades també es poden descobrir i compartir amb altres usuaris autoritzats de el sistema i la comunitat científica en general (sistema integrat amb el Beacon-Network i Matchmaker-Exchange) en un entorn segur, d’accés controlat i que preserva la privacitat (compleix amb el RGPD). RD-Connect GPAP actualment alberga dades fenotípiques i genòmiques processats (variants genètiques anotades) de més de 20.000 pacients i familiars i s’ha utilitzat per diagnosticar a centenars de pacients. Els arxius crus (sense processar) corresponents juntament amb la informació fenotípica es dipositen sota autorització de l’emissor de la EGA per a l’accés controlat a llarg termini i la reutilització de dades.

Rbbt

Rbbt és un marc per al desenvolupament de programari en bioinformàtica que significa “Ruby Bioinformatics Toolkit”. Aquesta solució integral per genòmica presenta eines que formen la base de la majoria de les tasques bioinformàtiques, com analitzar i ordenar dades, recopilar i configurar recursos com a eines de programari i bases de dades, organitzar la producció seqüencial de resultats per a treballs reutilitzables / reproduïbles, produir informes que es puguin compartir i empaquetar funcionalitats interoperables en mòduls connectables. Moltes de les funcions de Rbbt s’organitzen al voltant de quatre grans subsistemes: fluxos de treball, arxius TSV, gestió de recursos, HTML i REST. El marc Rbbt proporciona incentius per adherir-se a diversos estàndards raonables que milloren la reutilització i la interoperabilitat.

UseGalaxy

El projecte Galaxy és un motor de workflows el propòsit és fer que la biologia computacional sigui accessible a científics investigadors que no tinguin experiència en programació o en administració de sistemes. Galaxy va ser desenvolupat en el seu origen per a l’anàlisi de dades genòmiques, però ha evolucionat per esdevenir un plataforma agnòstica, sent usada per investigadors en quimioinformàtica, predicció del clima, etc. Per garantir la reproductibilitat, la plataforma captura els diferents paràmetres d’entrada, l’ordre d’execució de les passes de cada workflow així com els grups de dades (tant intermedis com finals) i permet compartir públicament o a nivell individual el resultat d’execució d’eines i workflows. A més, Galaxy suporta la pujada de dades des de diferents fonts (l’ordinador de l’usuari, per URL o des de diferents recursos en línia). En el marc de treball de el projecte VEIS, totes aquestes tasques es poden dur a terme a la plataforma ja que integra un gran nombre d’eines que poden interconnectar creant diferents workflows per processar les dades d’hospitals i altres centres de recerca.

Barcelona Computational Biomedical Platform (BCBP)

rexposome

Package that allows to explore the exposome and to perform association analyses between exposures and health outcomes.

CTDquerier

Comparative Toxicogenomics Database data extraction, visualization and enrichment of environmental and toxicological studies.

RD-Connect Genome-Phenome Analysis Platform (GPAP)

An online tool for diagnosis and gene discovery in rare disease research. The platform features allow identifying disease-causing mutations in rare disease patients and linking them with detailed clinical information.

APPRIS

Annotates variants with biological data such as protein structural information, functionally important residues, conservation of functional domains and evidence of cross-species conservation.

R-GADA

R implementation of GADA (Genetic Alteration Detector Algorithm) - used to detect CNVs from aCGH and intensity array SNP data

PanorOmics

The massive molecular profiling of thousands of cancer patients has led to the identification of many tumor type specific driver genes. However, only a few (or none) of them are present in each individual tumor and, to enable precision oncology, we need to interpret the alterations found in a single patient. Cancer PanorOmics (http://panoromics.irbbarcelona.org) is a web-based resource to contextualize genomic variations detected in a personal cancer genome within the body of clinical and scientific evidence available
for 26 tumor types, offering complementary cohort and patient-centric views. Additionally, it explores the cellular environment of mutations by mapping them on the human interactome and providing quasiatomic structural details, whenever available. This ‘PanorOmic’ molecular view of individual tumors should contribute to the identification of actionable alterations ultimately guiding the clinical decision-making process.

OncogenomicLandscapes

The widespread incorporation of next-generation sequencing into clinical oncology has yielded an unprecedented amount of molecular data from thousands of patients. A main current challenge is to find out reliable ways to extrapolate results from one group of patients to another and to bring rationale to individual cases in the light of what is known from the cohorts. We present OncoGenomic Landscapes, a framework to analyze and display thousands of cancer genomic profiles in a 2D space. Our tool allows users to rapidly assess the heterogeneity of large cohorts, enabling the comparison to other groups of patients, and using driver genes as landmarks to aid in the interpretation of the landscapes. In our web-server, we also offer the possibility of mapping new samples and cohorts onto 22 predefined landscapes related to cancer cell line panels, organoids, patient-derived xenografts, and clinical tumor samples.

Interactome3D

Network-centered approaches are increasingly used to understand the fundamentals of biology. However, the molecular details contained in the interaction networks, often necessary to understand cellular processes, are very limited, and the experimental difficulties surrounding the determination of protein complex structures make computational modeling techniques paramount. Interactome3D is a resource for the structural annotation and modeling of protein-protein interactions. Through the integration of interaction data from the main pathway repositories, we provide structural details at atomic resolution for over 12,000 protein-protein interactions in eight model organisms. Unlike static databases, Interactome3D also allows biologists to upload newly discovered interactions and pathways in any species, select the best combination of structural templates and build three-dimensional models in a fully automated manner.

dSysMap

Considering pathological genetic variants within the context of the human interactome network can help understanding the intricate genotype-to-phenotype relationships behind human diseases. It allows, for instance, to distinguish between changes that totally suppress a gene product (i.e. node removal) from the ones that might affect only one of its functions, modulating the way in which the protein interacts with its partners (i.e. edge-specific or edgetic1). dSysMap is a resource for the systematic mapping of disease-related missense mutations on the structurally annotated binary human interactome from Interactome3D.

Chemical Checker

The Chemical Checker (CC) is a resource that provides processed, harmonized and integrated bioactivity data on 800,000 small molecules. The CC divides data into five levels of increasing complexity, ranging from the chemical properties of compounds to their clinical outcomes. In between, it considers targets, off-targets, perturbed biological networks and several cell-based assays such as gene expression, growth inhibition and morphological profilings. In the CC, bioactivity data are expressed in a vector format, which naturally extends the notion of chemical similarity between compounds to similarities between bioactivity signatures of different kinds.

DisGeNET

DisGeNET is a discovery platform containing one of the largest publicly available collections of genes and variants associated to human diseases. DisGeNET integrates data from expert curated repositories, GWAS catalogues, animal models and the scientific literature. DisGeNET data are homogeneously annotated with controlled vocabularies and community-driven ontologies. Additionally, several original metrics are provided to assist the prioritization of genotype–phenotype relationships. The current version of DisGeNET (v7.0) contains 1,134,942 gene-disease associations (GDAs), between 21,671 genes and 30,170 diseases, disorders, traits, and clinical or abnormal human phenotypes, and 369,554 variant-disease associations (VDAs), between 194,515 variants and 14,155 diseases, traits, and phenotypes.

nf-core-viralrecon

nfcore/viralrecon is a bioinformatics analysis pipeline used to perform assembly and intra-host/low-frequency variant calling for viral samples. The pipeline supports short-read Illumina sequencing data from both shotgun (e.g. sequencing directly from clinical samples) and enrichment-based library preparation methods (e.g. amplicon-based: ARTIC SARS-CoV-2 enrichment protocol; or probe-capture-based).

BoostDM

BoostDM is a method to score all possible point mutations (single base substitutions) in cancer genes for their potential to be involved in tumorigenesis. The method is based on the analysis of observed mutations in sequenced tumors and their site-by-site annotation with relevant features. The compendium of cancer genes and the mutational features for each cancer gene across malignancies have been derived from the systematic analysis of tens of thousands of tumor samples (www.intogen.org). Other relevant features have been collected from public databases.

Intogen

IntOGen is a framework for automatic and comprehensive knowledge extraction based on mutational data from sequenced tumor samples from patients. The framework identifies cancer genes and pinpoints their putative mechanism of action across tumor types.

Cancer Genome Interpreter (CGI)

Cancer Genome Interpreter (CGI) is designed to support the identification of tumor alterations that drive the disease and detect those that may be therapeutically actionable. CGI relies on existing knowledge collected from several resources and on computational methods that annotate the alterations in a tumor according to distinct levels of evidence.

UEP

An open-source and fast classifier for predicting the impact of mutations in protein–protein complexes

sqtlseeker2-nf

Nextflow pipeline for splicing quantitative trait loci (sQTL) mapping based on sQTLseekeR2. The pipeline performs the following analysis steps: 1) Index the genotype file, 2) Preprocess the transcript expression data, 3) Test for association between splicing ratios (a multivariate phenotype) and genetic variants in cis (nominal pass), 4) Obtain an empirical P-value for each phenotype (permutation pass, optional), 5) Control for multiple testing.

ggsashimi

Command-line tool for the visualization of splicing events across multiple samples

FoldX

FOLD-X is a program for calculating the folding energies of proteins and for calculating the effect of a point mutation on the stability of a protein.

Nextflow

Nextflow enables scalable and reproducible scientific workflows using software containers. It allows the adaptation of pipelines written in the most common scripting languages.

PyHIST

PyHIST is a Histological Image Segmentation Tool: a lightweight semi-automatic pipeline to extract tiles with foreground content from SVS histopathology whole image slides (with experimental support for other formats). It is intended to be an easy-to-use tool to preprocess histological image data for usage in machine learning tasks. The PyHIST pipeline involves three main steps: 1) produce a mask for the input WSI that differentiates the tissue from the background, 2) create a grid of tiles on top of the mask, evaluate each tile to see if it meets the minimum content threshold to be considered as foreground and 3) extract the selected tiles from the input WSI at the requested resolution.

PMut

Web portal for the annotation of pathological protein variants.

epico-data-analysis-portal

EPICO / BLUEPRINT Data Analysis Portal, data loading scripts and API

MAD

Mosaic Alteration Detector

invClust

invClust is a method to detect inversion-related haplotypes in SNP data

Rbbt

Rbbt stands for “Ruby Bioinformatics Toolkit”. It is a framework for software development in bioinformatics. It covers three aspects:. Workflow wrapping the TF text-mining results from https://github.com/fnl plus other databases. High Throughput Sequencing related functionalities. Rbbt wrapper for the Variant Effect Predictor. Auto-downloads and installs the software.

Rbbt-HTS

Workflow for variant calling and other functionalities

nf-core-chipseq

nf-core/chipseq is a bioinformatics analysis pipeline used for Chromatin ImmunopreciPitation sequencing (ChIP-seq) data.

nf-core-rnaseq

nf-core/rnaseq is a bioinformatics analysis pipeline used for RNA sequencing data.

nf-core-smrnaseq

nf-core/smrnaseq is a bioinformatics best-practice analysis pipeline used for small RNA sequencing data.

nf-core-atacseq

nfcore/atacseq is a bioinformatics analysis pipeline used for ATAC-seq data.

nf-core-cageseq

nf-core/cageseq is a bioinformatics analysis pipeline used for CAGE-seq sequencing data.

nf-core-methylseq

nf-core/methylseq is a bioinformatics analysis pipeline used for Methylation (Bisulfite) sequencing data. It pre-processes raw data from FastQ inputs, aligns the reads and performs extensive quality-control on the results.

nf-core-hic

This pipeline is based on the HiC-Pro workflow. It was designed to process Hi-C data from raw FastQ files (paired-end Illumina data) to normalized contact maps. The current version supports most protocols, including digestion protocols as well as protocols that do not require restriction enzymes such as DNase Hi-C. In practice, this workflow was successfully applied to many data-sets including dilution Hi-C, in situ Hi-C, DNase Hi-C, Micro-C, capture-C, capture Hi-C or HiChip data.

MuGVRE

The MuG Virtual Research Environment is an analysis platform for 3D/4D genomics analyses. It integrates genomics tools for chromatin dynamics data.

methylclock

Estimate chronological and gestational DNA methylation (DNAm) age as well as biological age using different methylation clocks

recombClust

recombClust is a R package that groups chromosomes by their recombination history. Recombination history is based on a mixture model that, given a pair of SNP-blocks, separates chromosomes in two populations, one with high Linkage Disequilibrium (LD) and low recombination (linkage) and another with low LD and high recombination. The method use the classification of several SNP-block pairs in a region to group chromosomes in clusters with different recombination history. This package takes as input genotype phased data.

MADloy

Robust detection of mosaic loss of chromosome Y from genotype-array-intensity data | R package to detect mosaic loss of Y events in SNP array and NGS data | R package to detect mosaic loss of Y events (LOY) from SNP array data

RD-connect

Integrated platform connecting databases, registries, biobanks and clinical bioinformatics for rare disease research.

TADbit

Computational framework to analyze and model 3C-based experiments.

3DID database of 3D interacting domains

Collection of protein interactions for which high-resolution three-dimensional structures are known. The interface residues are presented for each interaction type individually, plus global domain interfaces at which one or more partners (domains or peptides) bind. The web server visualizes these interfaces along with atomic details of individual interactions using Jmol.

scoreInvHap

It can get the samples' inversion status of known inversions. It uses SNP data as input and requires the following information about the inversion: genotype frequencies in the different haplotypes, R2 between the region SNPs and inversion status and heterozygote genotypes in the reference. The package include this data for two well known inversions and for two additional regions.

inveRsion

Package to find genetic inversions in genotype (SNP array) data.

Benchmarking

OpenEBench

OpenEBench és una plataforma dissenyada per facilitar el treball de les comunitats científiques com VEIS que s’uneixen per abordar els reptes que tenen en el seu domini i decideixen tirar endavant el benchmarking científic dels seus mètodes i eines. D’aquesta manera, OpenEBench beneficia les comunitats però també a la resta dels científics, a l’oferir un lloc de referència on es pot accedir a aquestes dades i es poden prendre decisions informades. Això inclou tant a desenvolupadors de programari, que podran desenvolupar mètodes més eficients en comparar els seus resultats amb els d’altres recursos similars, com investigadors, que tendeixen a tenir dificultats per triar l’eina adequada per al repte en qüestió que volen resoldre.

OpenEBench Software Observatory

L’Observatori de Software OpenEBench pretén ser un instrument per a l’observació sistemàtica i el diagnòstic del programari de recerca en Ciències de la Vida. Promou l’adopció de les bones pràctiques de desenvolupament de programari, identificant tendències en la manera com s’està desenvolupant el programari de recerca. Per ajudar a detectar necessitats i estratègies a adoptar, VEIS s’ha afegit com a nova comunitat de l’Observatori de Software. Totes les eines que formen part de la Plataforma Biomèdica Computacional de Barcelona (BCBP) es visualitzen a l’Observatori de Software, facilitant la comprensió de la qualitat del programari dins del consorci.