Government of India committed to fostering sustainable development through democratizing and disseminating this national genetic resource knowledge
The ‘GENOMEINDIA’, funded by the Department of Biotechnology of the Central Government has completed whole genome sequencing (WGS) database of over 10,000 individuals representing all major population groups, across the country. GenomeIndia data represents Government of India's commitment to scientific inquiry and is poised to reshape health and science in India and beyond, fostering sustainable development through democratizing and disseminating this national genetic resource knowledge.
The result oriented cumulative proactive actions taken by the Department towards setting up of IBDC, release of Biotech PRIDE Guidelines, formulation of FeED Protocols, transfer and storage of GenomeIndia Data in IBDC; followed by the announcements by the highest leaderships in the country indicate strong determination of the Government for sharing of this data with our researchers to analyze critical information, accelerating discoveries and advancements in biological sciences.
For the first time in the country, the department has established the Indian Biological Data Center (IBDC) in March 2020 with 96 TF computing capacity using 2912 CPUs, 39 TB of RAM, 865 TF computing capacity using 64 GPUs, 4 PB of parallel file system with the capability of writing 100GB of data every second and 1.5 PB of disk and tape to store backup copy of data. The Department has released the Biotech-PRIDE Guidelines, 2021 followed by formulation of ‘Framework for Exchange of Data (FeED) Protocols’ for responsible data sharing.
On 9th January 2025, during the ‘Genomics Data Conclave’, the 'GenomeIndia Data' was dedicated to the researchers by Shri Narendra Modi, Prime Minister of India. The Prime Minister stated that this national database encapsulates the extraordinary genetic landscape of India and will serve as an invaluable scientific resource to boost genetic and medical research for human health. Further, during the address to the nation on the evening of 25th January, 2025, Her Excellency, Smt. Draupadi Murmu, President of India said that GenomeIndia project marks a significant chapter in the history of Indian Science.
The department also announced the ‘Call for Proposals’ from researchers to exploit the opportunities of translational research using GenomeIndia data.  To address the queries of the researchers, the Department issued the addendum mentioning the types and category of data that will be available for research, also “associated phenotype data” will be shared. It is clarified that access to GenomeIndia data is not limited to the ‘Call’ but independent requests for data access are being received by IBDC, under the ambit of Biotech PRIDE Guidelines and FeED Protocols.
As on date, this National Resource generated under the GenomeIndia project comprises of Fastq files of 9772 samples (~700 TB), gvcfs: 9772 (~35 TB), phenotypic data from 9330 samples and Joint call files (~3.5 TB) and is archived at IBDC, the National Repository.
To brief about the issue of phenotype data as mentioned in one of the news articles in a leading newspaper, it is stated here that curation and cleaning up of phenotypic data was performed on 9772 samples which underwent WGS and were used in joint calling. Out of these 9772 samples, phenotypic data from 9330 samples could be used because the data available for the rest of the samples (numbering 442) was not usable. Many phenotypic parameters had very high levels of missingness, so the data for the top 27 variables for 9330 samples is available for research. These 27 variables are Albumin, Alkaline_Phosphatase, ALT_SGPT, AST_SGOT, Basophils, Cholesterol, Creatinine, Direct_Bilirubin, Eosinophils, FBS_Fasting_Blood_Glucose, HB_Haemoglobin, HbA1C_Glycosylated_Haemoglobin, HDL, Indirect_Bilirubin, LDL, Lymphocytes, MCH_Mean_Corpuscular_Hb, Monocytes, Neutrophils, Platelet_Count, Protein, RBC_Red_Blood_Cell_Count, RBS, Total_Bilirubin, Triglycerides, Urea, WBC_Total_White_Blood_Cell_Count. The anthropometry data such as: Age, Gender, Height, Weight, Body Fat is also available.
Further, some of the news articles have also raised concern about making ‘No Access’ for FASTQ files. It is pertinent to mention here that the total size of FASTQ files is approximately 700 TB. The logistical and technical challenges of transferring these files are enormous. It is difficult to ensure the completeness and sanctity of downloads by requesters. Analyzing raw sequencing files often demands two- to three-times more computational capacity, leading to redundant workflows and wasted infrastructure at the national level. By providing equitable access to gVCF files (which amounts to ~35 TB) instead, data can be shared more quickly and computational resources conserved. The international leading data banks established for more than 2 decades also does not allow the downloading of data; data is provided by their cloud platform. Hence, ‘No Access’ to FASTQ files in the department’s ‘Call’ means that these files will not be available for download at present. This policy is in line with other global consortia. As IBDC will grow and expand in future, similar provisions may be incorporated.
Source: pib.gov.in