Even after so many years, I still remember one of my very first logistic regression projects. I was so proud of it. Until my supervisor kicked it back and told me that I had to weigh it.
Weigh it? What did that mean?
Little did I know that this was a very important element for certain types of work. If it is not done when needed, the results will be incorrect.
Weighting
At first, I struggled to figure out how it worked and why I needed it. One day, having coffee with a fellow data scientist, he explained it in a way I have never forgotten.
Setting his coffee cup down a fourth time, leaving a ring mark. He was working on what appeared to be the Olympic logo. He started.
Say you take a sample of 100 people from each state. The 39 million people in California are represented by 100 people, just as the 2 million people in Nebraska are.
Thus, the weight of each Californian should be 17 times that of each sample from Nebraska.
Executing logistic regression without weighting will yield grossly wrong results.
But it doesn’t stop there. Sampling population is tricky.
There are multiple ways to achieve the desired output, depending on your questions.
Why Does NHANES Require Weighting?
The National Health and Nutrition Examination Survey (NHANES) is a CDC dataset that assesses the health and nutritional status of adults and children in the US.
I’ve used that data for many personal and professional projects. Anyone who wants to practice and sharpen their data science skills, this is a really good source.
NHANES deliberately oversamples groups (minorities, elderly, low-income) to get stable estimates on these subgroups and ensure they are represented.
Without weighting, the analysis treats everyone the same, whether a low-income Hispanic participant or a white suburban participant.
Dealing With Complex Sample
The NHANES sample is very complex. Failing to properly work the data and to consider the sampling design characteristics will lead to incorrect estimates of standard errors and increase false-positive findings.
Dey et al.(2025) explain that when dealing with complex sample surveys like NHANES, you should consider three things: sampling weights, stratification, and cluster sampling.
Likewise, Birrell et al. (2023) gives a practical example of the problem. Using NHANES, the mean age of an unweighted analysis is 51.15. Whereas, when properly weighted, the mean drops down to 46.91.
That is a considerable difference that would alter the entire analysis.
Conclusion
While I learned this very early in my career, I was shocked to learn that most of the studies fail to use the proper analysis.
Dey et al. (2025) explains that 41.7% of studies where data analysis was involved didn’t use the correct methods. To me, the most critical one, according to their paper, was that only 2.7% applied imputation techniques.
A little disturbing when these studies are used to make decisions.
Comments (0)
Leave a Comment
No comments yet. Be the first to comment!