​UZIMA-DS Project

UZIMA-DS: Utilizing Health Information for Meaningful Impact in East Africa through Data Science

Project Period: 2021 – 2026​

Funder: National Institute of Health

Collaborators: Aga Khan University; University of Michigan; Dalhousie University; Ottawa University; Kenya Medical Research Institute- Wellcome Trust Research Programme; Clinton Health Access Initiative

Overall Project Summary: The UZI​MA-DS (UtiliZe health Information for Meaningful impact in East Africa through Data Science) Hub aims to create a scalable and sustainable platform to apply novel approaches to data assimilation and advanced artificial intelligence/machine learning-based methods to serve as early warning systems to critical health issues impacting Africans in two domains: maternal, new-born and child health and mental health.​

UZIMA-DS brings together method experts in statistics, computer science, and informatics, healthdomain experts and practitioners, and partnerships with key stakeholders to not only improve the quality, efficiency, and relevance of multidisciplinary data science in health research, but also its transparency, reproducibility, and dissemination for sustainable impact in Africa. Thus, helping ensure current and future generations of Africans can achieve uzima (health/well-being in Swahili).  

Ultimately, the UZIMA-DS hub will develop a scalable and sustainable platform characterised by:

  •  Harmonization of multimodal data sources for meaningful use and analyses.
  •  Leveraging temporal patterns of data to identify trajectories through prediction modelling using AI/ML-based methods, and
  • ​Engaging with key stakeholders to identify pathways for dissemination and sustainability of these models into target communities.​

Data Management and Access Core (DMAC)

The Data Innovation team plays a vital role in supporting the data infrastructure needs of the UZIMA-DS project. Operating within a cloud-first, open-source environment, the UZIMA-DS architecture undergoes continuous iteration, incorporating emerging data algorithms, best practices, and standards. The team's primary responsibility is to ensure that all data required by researchers undergoes processing and cleansing through high-quality data pipelines and is then organised within a data model optimised for analysis. Additionally, the DIO team provides essential training and support to researchers on accessing cloud resources and utilizing tools effectively. Furthermore, the team ensures compliance with Kenyan Data Protection laws and serves as the primary contact with the Office of the Data Protection Commissioner.​

We are part of the Data Management and Access Core for the UZIMA-DS Hub. Our work involves facilitating and supporting effective data management and analysis using FAIR (Findable, Accessible, Interoperable, Reusable) principles, for the UZIMA-DS Research Hub, cross-DS-I Africa consortium collaborators, and the DS-I Africa Coordinating Centre. The DMAC addresses the following objectives:

  • Support the Research Hub’s data ecosystem through the development and maintenance of data quality assurance measures, standards for statistical code sharing, data reproducibility, sharing and interoperability.
  • ​Facilitate data analytics utilizing AI/ML methods and provide analytical support for the Hub’s research projects, and 
  • F​oster data sharing, interoperability, and meta-data approaches across the greater DS-I Africa Consortium.

The long-term goal of the DMAC is to develop a pipeline of data support, data use and data sharing capacity to facilitate high-quality research in East Africa with the potential for a model platform that can be scalable, reproducible, and shareable.​​


Project Title: Sustainable Cloud Operations for Research and Environmental Impact (SCORE – EI)

Overview

Research projects in low- and middle-income countries (LMICs) encounter significant challenges with cloud utilization, such as limited resources, tight budgets, and a lack of expertise in cloud infrastructure. These obstacles hinder the ability to scale research collaborations within the DS-I consortium and beyond. To address these issues, we propose developing and documenting good practices for building efficient and cost-effective cloud-based data pipelines for research. We will also address scaling cloud infrastructure responsibly, ensuring that as we expand, we do so without proportionally increasing our carbon footprint. We will utilize data pipelines from the ongoing UZIMA-DS project, which processes daily Fitbit data from 500 healthcare workers on a Microsoft Azure instance. This use case will serve as a foundation for building and evaluating data pipelines using three tools and approaches within the Azure environment. Our primary goal is to develop and implement good practices for constructing efficient, cost-effective cloud-based data pipelines that minimize environmental impact. By doing so, we aim to promote sustainable and scalable research practices in LMIC settings, captured in our motto, “Efficient Data, Sustainable Future for Research." The impact of this work includes enhanced research efficiency, cost savings, environmental sustainability, and empowering the DS-I Africa consortium to adopt sustainable cloud practices.

Background of the Project

The parent project, UZIMA-DS (UtiliZing Health Information for Meaningful Impact in East Africa through Data Science), aims to create a scalable and sustainable platform that leverages novel data assimilation approaches and advanced Artificial Intelligence (AI) and Machine Learning (ML) methods to improve health outcomes in two key domains: maternal, newborn, and child health (MNCH), and mental health (MH).

Research communities in low- and middle-income countries (LMICs) face significant resource constraints in cloud environments. These constraints include limited financial resources, insufficient access to high-performance computing infrastructure, and a lack of technical expertise to manage and optimize cloud resources effectively. Moreover, the environmental impact of cloud computing is an increasingly pressing issue. The energy consumption of processing and storing large volumes of high-frequency data contributes to a significant carbon footprint. This was among the emerging issues that Prof Keymanthri Moodley (Contact-PI, REDSSA project) presented during the 3rd Annual DS-I Africa consortium meeting. This issue is further exacerbated when resources and services are not optimized, resulting in wasted computational power and further complicating the challenge of conducting sustainable research.

Contributing to these challenges is the need for guides and good practices for designing optimal and efficient data pipelines. With clear guidelines, research teams can avoid suboptimal configurations that lead to wasted computational resources and increased operational costs. This inefficiency strains limited budgets and leads to higher energy consumption, thereby increasing the carbon footprint. Balancing resource limitations and environmental sustainability while providing actionable good practices for data pipeline optimization is essential for empowering LMIC research communities to leverage advanced data science methodologies effectively and responsibly

Overall Objective and Key Deliverables

Our goal is to develop and implement good practices for constructing efficient, cost-effective cloud-based data pipelines that minimize environmental impact. This project aims to document methodologies that optimize resource utilization, reduce operational costs, and lower the computational energy associated with data processing. By doing so, we aim to promote responsible and scalable research practices within the DS-I Africa projects and in LMIC settings for “Efficient Data, Sustainable Future for Research."

Objective 1: To assess and minimize the environmental impact of cloud computing in data-intensive research

Our main aim is to evaluate and assess the carbon emissions associated with the processing and storing of large volumes of high-frequency data in cloud environments. Our key components include 1) measuring existing data processing workflows' carbon emissions and 2) implementing and testing energy-efficient practices and methodologies to reduce carbon emissions.

Objective 2: To identify, develop, and implement good practices for cost-efficient and resource-optimized cloud-based data pipelines

We aim to develop a framework of good practices that optimize cost efficiency and resource utilization of cloud-based data pipelines in LMIC research settings. Our key components include 1) developing strategies to minimize financial expenditures on cloud resources, including optimizing the selection and usage of cloud services and 2) ensuring high-performance computing infrastructure is utilized effectively to handle large datasets, fine-tune computational processes, and leverage advanced data management techniques.

Expected Results/Impact

Through the development and application of effective practices for cost-efficient and resource-optimized cloud-based data pipelines, our goal is to enhance the capabilities of researchers. This will enable them to effectively utilize advanced data science methodologies for impactful research outcomes. Beyond the immediate benefits to research, this project advocates for sustainable and efficient practices within the scientific community. This contribution aligns with global efforts to mitigate climate change and ensure a sustainable future for research endeavours. Through our focus on capacity building and the dissemination of good practices, we aim to stimulate continuous improvement and innovation in research practices. This, in turn, will hopefully drive advancements in knowledge and innovation across key research areas​.