IMS Laboratory for Large-Scale Biomedical Data Technology

Laboratory for Large-Scale Biomedical Data Technology, RIKEN Center for Integrative Medical Sciences (IMS)

The ever-transforming FANTOM Project:
Building a database to fully utilize rapidly-increasing DNA/transcription data

Functional Annotation of the Mammalian Genome (FANTOM) is an international research consortium that publishes information on transcription in mammals, primarily humans and mice. The FANTOM Project data management team, led by Takeya Kasukawa, assembles, organizes, and publishes the data produced and analyzed in the project. FANTOM is constantly transforming as it analyzes something different; FANTOM6, the sixth edition of the project, is currently underway. The data provided by FANTOM and its offshoot projects are giving scores of ideas to researchers worldwide.

Takeya Kasukawa
Team Leader
Laboratory for Large-Scale Biomedical Data Technology, RIKEN IMS

An ever-transforming transcription database

Researching genes and transcription without information from databases has become difficult. We have been working on the FANTOM Project, which acquires, analyzes, organizes, and provides large-scale data on transcription. The project began in 2000 and is in its sixth edition as of 2023. With each edition, we have set a new target and provided relevant data.

Fig. 1: Transformations of the FANTOM Project
Many browsers have been released to show project data in a format that researchers can utilize easily.

Before the genome was fully sequenced, research frequently used DNA complementary to mRNA (cDNA). Experiments were conducted to determine which cDNA coded which protein; information was attached to these gene sequences (functional annotation), which were then published. With each of the first three editions of FANTOM, the scale grew larger; FANTOM3 arranged cDNA information for roughly 100,000 mice and published it under the name, “RIKEN cDNA annotation viewer”.

During FANTOM3, an experimental technique Cap Analysis of Gene Expression (CAGE) was developed at RIKEN. This technique enabled efficient, comprehensive examination of activity intensity in transcription start sites and regions upstream (promoters). Consequently, transcriptional regulatory networks could also be visualized. FANTOM3 to FANTOM5 analyzed this transcriptional regulation, combined this information with other information such as transcription start sites, organized it, and published it.

Ultimately, FANTOM5 assembled data for 3000 samples from humans, mice, and rats, and amassed data on transcription start sites and their activity in various cells (Table 1). Data collection on such a large scale is almost unheard of anywhere in the world. This data is also utilized as source data in other databases.

Picture of table showing Volume of data in FANTOM5

Table 1: Volume of data in FANTOM5
Sequencing results from not only CAGE but also RNA-Seq and miRNA-Seq underwent data processing, and activity data was organized. The FANTOM Consortium, which is centered on RIKEN, performed all experiments and obtained data.

FANTOM6, which is in progress as of 2023, seeks to determine the function of long noncoding RNA (lncRNA). From the lncRNAs identified in FANTOM5, FANTOM6 has selected 300 lncRNAs to be analyzed from perspectives such as primary cells and relationships to diseases. By using CAGE to analyze how the expression of these lncRNAs changes when knocked down, FANTOM6 is inferring the function of lncRNA. In some cases, knocking down a certain lncRNA changes overall expression; thus, it seems certain that lncRNA serves an important function.

Developing a browser to present FANTOM data in an optimal manner

A concerted effort has also been made to present FANTOM data from a perspective that researchers can utilize easily (Fig. 1). Users can look up genomic coordinates using ZENBU, a browser associated with FANTOM5. When users look up a given gene, they can view all information related to that domain that FANTOM5 has examined. As an example, this functionality would be highly useful in researching diseases caused by genetic mutations. SSTAR, a browser for users interested in genes, enables users to learn about the data and analysis results obtained in FANTOM5 .

A picture sampled of information viewed with the FANTOM5 browser SSTAR

Fig. 2: Example of information viewed with the FANTOM5 browser SSTAR
Information on the sequence of the transcription factor SOX2 (upper) and six transcription start sites (lower).

The FANTOM CAT Browser, gene data sets that integrate CAGE data with transcriptome data, is characterized by its wealth of ncRNA data. Other browsers include the FANTOM5 miRNA atlas, which allows users to search for miRNA information; and Cell Connectome Visualization, which enables users to look up the relationships between ligands and receptors and to provide clues regarding interactions between cells.

In fact, FANTOM5 is well known as a source of information for finding enhancer candidates. SlideBASE, a browser for looking up enhancer information, was made by processing FANTOM5 information.

FANTOM offshoot projects

Our team is also working on other projects, such as expanding FANTOM data and launching associated new databases. refTSS, which integrates FANTOM transcription start site data with similar data published by research institutions around the world, is a data set that can be used as reference data for analysis.

INTRARED is a database which integrates two transcriptional regulation-related databases in an attempt to increase their value. These two databases are: fanta.bio, which was created by RIKEN and the Tokyo Metropolitan Institute of Medical Science; and ChIP-Atlas, which was created by Kyoto University. INTRARED was created with the aim of enabling users to look up where transcription factors bind on the genome and what sort of transcription occurs.

The ever-increasing importance of information maintenance

Just as companies like Google with their own large quantities of data put effort into developing the next service, bioresearch also now requires the application of massive data. Based on the ideal of sharing data obtained through research with the world, more and more academic journals are requesting that data be published. However, data cannot be applied if it is flawed in some way or if the system service that runs the browser has ended, even if the data is maintained. We research database construction and management every day to maintain powerful and valuable databases to support research worldwide.

（Article by: Kaori Oishi／Photo by: Tadashi Aizawa／Production assistance: Sci-Tech Communications)

RIKEN Ppen Life Science Platform