Growing up, watching the first presidential elections after the military dictatorship in Brazil caught my attention.

I watched a very young governor take the lead in the general elections. At the time, he was winning by 30%. After the count, the polls were very close to the actual outcome.

Before the news began focusing on the runoff polls, I asked my parents if anyone had reached out to ask who they were voting for. Of course, no one had.

It was not until I got into statistics that I learned what really happens.

They don’t need to ask the entire population to statistically tell who someone is voting for.

All that is needed is 1,000 people.

Random Sampling

For this to work out, it cannot be any 1000 people.

Allow me to use an example that most people would be familiar with. Let’s say that David Wallace is conducting a climate survey, an assessment of the perception of culture, safety, and inclusion at Dunder Mifflin. While the exact number of employees is not disclosed during the show, let’s suppose the company has around 500 people.

Because the population is considerably small, we can use the 10% rule. That means we only need a sample of 50 people. During the series, I counted 13 branches that were either active, closed, or merged, plus the corporate office in New York. For the purposes of this exercise, let’s say there are only 9 branches and the corporate office.

If Wallace were to take 50 people from Scranton, Utica, and Corporate, it would be both correct and fundamentally wrong. Correct because he has 50 people. Wrong because they don’t represent all 10 locations equally. Even if it showed a 3% margin of error and 95% confidence level.

The correct way would be to take 5 people from each location. Using random sampling, Wallace would have to select 5 people from each location.

Random sampling means that everyone has the same chances of being selected. So, Toby and Kelly would have a 1/500 chance of being selected alike.

Weighting

A complete sample is not enough on its own. Let’s say that Scranton has 18 employees. Utica has 25. Corporate has 32. You only have 5 to represent the 32 people from corporate and another 5 for the 18 in Scranton. That wouldn’t be a fair assessment.

Thus, the importance of weighing the data before applying the model. In a simple version, weighting them would look like this:

w = (Location Size)/(Sample Size from Location)

So:

Location

Employees

Sample

Weight

Scranton

18

5

18/5 = 3.6

Utica

25

5

25/5 = 5.0

Corporate

32

5

32/5 = 6.4

And so on...

What do all these numbers mean? Each Scranton respondent represents 3.6 employees. Each Corporate represents 6.4. You catch my drift.

If we are doing 10%, why not take 10% of each branch instead of doing all of these calculations?

I’m glad you asked.

Equal Allocation vs Proportional Sampling

Using proportional sampling wouldn’t work well here. If each branch had exactly 50 employees, taking 5 from each would be a no-brainer. But branches aren’t equal, or the numbers might not work out properly. Scranton has 18 employees. You wouldn’t be able to take answers from 1.8 people.

That’s not the real reason. It is one of the reasons.

If a branch has 8 employees, 10% would be less than 1 person. In such a case, that branch would completely disappear from the analysis as the number would be statistically irrelevant. Hence, the equal allocation, five from each branch, guarantees that every location has a voice. Weighting it corrects the size difference mathematically.

Why is 1,000 Usually Enough

When explaining this, everyone likes to use the soup as an example. Imagine a chicken noodle soup. If you eat a bowl of that soup, you will know what the entire pot tastes like. That analogy simplifies the process and helps people understand it.

But for the 1,000 people to work properly, it has to be done carefully and methodically. Similar to the Dunder Mifflin example, now let’s apply that to an election poll.

A pollster would have to survey 20 people in each state before weighting them. If the 20 people surveyed from New York were from the city, the state would skew drastically blue. In reality, most counties in the state are red.

Statistically speaking, 1,000 people will only really represent millions when the sample is truly random.

Conclusion

In short, 1,000 can, in many cases, represent millions of people. There is one important factor. The methodology behind it all must be sensible. The inferential analysis piece is critical. Unless the numbers are properly distributed and weighted, the results will be incorrect.

The Pew Research Center is a perfect example of how this can work. For years, they have used a small group of people to generate solid information where 1,000 people have represented the USA in its entirety.