As researchers learn more about the role of the gut bacteria in modulating human health and disease, a major question is how broadly these conclusions can be applied among global populations with very different gut microbiomes. Importantly, many world regions are underrepresented in microbiome literature, which is dominated by high-income countries such as the United States. However, in the highly complex environment of the human gut, understanding the impact of these imbalances can be challenging without access to very large sample sizes.
A new study by researchers in the laboratory of Ran Blekhman, PhD, at the University of Chicago and the University of Minnesota delivers an innovative approach to this problem. In a study published on January 22, 2025, in Cell, they present the largest publicly available dataset of its kind, the Human Microbiome Compendium. This initiative worked by uniformly processing more than 160,000 gut microbiome samples from 68 nations – the first time this type of data has been processed at this scale.
“The purpose of our dataset is to take already available studies of different microbiomes from all around the world, from people who are young and old and sick and healthy, and integrate them into one uniform data set”, said co-lead author Samantha Graham, a graduate student at the University of Minnesota. “The utility is that we’re able to find patterns that we wouldn't have been able to find previously.”
16S amplicon sequencing – a method of identifying microorganisms based on a region of their genomes – has long been a mainstay of microbiology due to its ease of use and low cost. While the baseline steps of 16S sequencing and analysis are similar study-to-study, every lab collects and processes its data differently, making it difficult to combine results from multiple projects.
In the new study, instead of relying on summary tables from disparate papers, researchers took on the daunting task of re-processing the raw data from 168,464 samples identically.
On the surface, this may seem like a straightforward, if not time-consuming process of revisiting hundreds of individual projects. But in practice, the researchers found that some of the most challenging technical issues stemmed from choices made by the study authors.
“Every researcher provides their sample data in their own way. Sometimes it's completely absent. Sometimes it's very, very thorough. But most of the time, it’s in very different formats,” said study author Richard Abdill, PhD, the lab manager for Blekhman’s group.
Once the team was able to organize this massive amount of data, the power afforded by such a high sample count allowed them to ask new questions.
Some world regions, such as the United States, have been sampled so extensively that repeated sampling brings few surprises. Other world regions, such as Northern Africa and Western Asia, have had very few studies with publicly available data. The researchers demonstrated how adding just a few samples uncovers much more information about the identity and abundance of microorganisms in these regions, a phenomenon termed the discovery rate.
“From a dollars and cents perspective, you’re going to learn more if you look at microbiomes that haven't been studied as extensively,” said Graham.
Further, the authors identified potential issues in generalizing microbiome patterns to other populations. For example, the proportions of the microbes Bacteroides, Bifidobacterium and Prevotella have well-studied roles in conditions such as obesity and inflammatory bowel disease. However, the researchers showed how many of these organisms have very different patterns of abundance in world regions outside of Europe and North America, calling into question the applicability of findings from Western-centric studies to global populations.
The researchers hope that the Human Microbiome Compendium serves as a resource to others. They noted how such a large sample size would benefit machine learning applications tuned to pick up complex patterns relevant for human health and disease.
They also hope their findings provide perspective for global microbiome research.
“While the larger problem of lopsided representation still exists, we were really excited to be able to quantify the extent of that issue. To try to understand what the implications might be. We have made a big step in that direction,” said Abdill.
This work is supported by NIH grant R01LM013863.