Saturday, February 7, 2026

Sampling and Data


I. Data Classification

Data may come from a population or from a sample. Small letters like x or y generally are used to represent data values. Most data can be put into the following categories:
  • Qualitative (Categorical)
  • Quantitative (Numerical)
Qualitative data are the result of categorizing or describing, attributes (qualities or characteristics) of a population. It classifies individuals into groups or categories. Qualitative data are also often called categorical data. Hair color, blood type, ethnic group, the car a person drives, and the street a person lives on are examples of qualitative data. Qualitative data are generally described by words or letters. For instance, hair color might be black, dark brown, light brown, blonde, gray, or red. Blood type might be AB+, O-, or B+. Researchers often prefer to use quantitative data over qualitative data because it lends itself more easily to mathematical analysis. For example, it does not make sense to find an average hair color or blood type.

Example: 
A researcher records the blood type (A, B, AB, or O) of 50 patients. What kind of data is this?

Answer: Qualitative


Quantitative data are always numbers. Quantitative data are the result of counting or measuring attributes of a population. Amount of money, pulse rate, weight, number of people living in your town, and number of students who take statistics are examples of quantitative data. Quantitative data may be either discrete or continuous.

  • Quantitative Discrete Data: Data that are the result of counting. These data take on only certain numerical values (usually whole numbers). If you count the number of phone calls you receive for each day of the week, you might get values such as zero, one, two, or three.
  • Quantitative Continuous: All data that are the result of measuring (and hence doesn't have to be whole numbers). If you and your friends carry backpacks with books in them to school, the numbers of books in the backpacks are discrete data and the weights of the backpacks are continuous data.
Example:
A fitness app tracks the number of steps a user takes each day. What kind of data is this?

Answer: Quantitative Discrete (counting steps)

Example:
 A fitness app tracks the distance a user walks each day. What kind of data is this?

Answer: Quantitative Continuous (measuring distance)



II. Data Visualization

Omitting Categories, Missing Data and Chart Selection

The table displays Ethnicity of Students but is missing the “Other/Unknown” category. This category contains people who did not feel they fit into any of the ethnicity categories or declined to respond. Notice that the frequencies do not add up to the total number of students. In this situation, create a bar graph and not a pie chart.

The following graph is the same as the previous graph but the “Other/Unknown” percent (%) has been included. The “Other/Unknown” category is large compared to some of the other categories (Native American, %, Pacific Islander %). This is important to know when we think about what the data are telling us.

This particular bar graph in Figure 2 can be difficult to understand visually. The graph in Figure 3 is a Pareto chart. The Pareto chart has the bars sorted from largest to smallest and is easier to read and interpret
*Note: This lessons mention of a Pareto Chart is superficial and incomplete. A Pareto chart is a type of chart that contains both bars and a line graph, where individual values are represented in descending order by bars, and the cumulative total is represented by the line.

III. Sampling Methods

Gathering information about an entire population often costs too much or is virtually impossible. Instead, we use a sample of the population. A sample should have the same characteristics as the population it is representing. Most statisticians use various methods of random sampling in an attempt to achieve this goal. This section will describe a few of the most common methods. There are several different methods of random sampling. In each form of random sampling, each member of a population initially has an equal chance of being selected for the sample. Each method has pros and cons.

A. Random Sampling (Scientific/Unbiased)

1. Simple Random Sample (SRS): A simple random sample (SRS) is a subset of a larger population where every single member of the population has an equal chance of being selected.

Example:
Suppose Lisa is in a pre-calculus class with 31 other students. She wants to form a 4-person study group consisting of herself and 3 other people from the class. To make it fair, she decides to use a Simple Random Sample to choose her partners. She writes the names on slips of paper, puts them in a drum and mixes them up thoroughly. She then reaches in and pulls out 3 names. Alternatively, she could have assigned each name a number then run the numbers through a random number computer program.

2. Stratified Sample: The population is divided into subgroups called strata based on a specific characteristic (like age, gender, or income). You then take a random sample from every subgroup to ensure they are all represented.

Example:
A high school principal wants to know student opinions on a new dress code. To make sure every grade has a voice, they divide the students into four strata: Freshmen, Sophomores, Juniors, and Seniors. They then randomly select 25 students from each grade level.

3. Cluster Sample: The population is divided into many subgroups called clusters (often based on geography). You randomly select a few entire clusters and collect data from everyone inside those selected groups.

Example:
A fast-food chain wants to survey its employees nationwide. Instead of visiting every store, they randomly select 10 specific locations (clusters) across the country. They then interview every single employee working at those 10 locations, while ignoring the employees at all other locations

4. Systematic Sample: Select a random starting point and take every n'th member from a list.

Example:
A quality control manager at a lightbulb factory wants to test for defects. They decide to test every 50th lightbulb that comes off the assembly line. They pick a random number between 1 and 50 to start (let's say 12), and then test the 12th, 62nd, 112th, and 162nd bulbs.

B. Non-Random Sampling (Biased)

Convenience Sampling: Using results that are readily available (e.g., asking people in a mall). This often leads to Bias, where certain outcomes are favored over others.

IV. The Mechanics of Sampling

A. With vs. Without Replacement

As mentioned earlier, with the simple random survey, every single member of the population has an equal chance of being selected. This type of sampling is referred to as sampling with replacement. Sampling with replacement refers to the process where an item is selected from a population, and after being selected, it is "replaced" back into the population before the next selection. This means that the same item can be chosen multiple times in the same sampling process. 

Sampling without replacement refers to sampling where once a member of the population is selected, they are removed from the pool and cannot be picked again. This is the standard method for surveys and polls

While these two methods are technically different, they become mathematically almost equivalent when the population is large and the sample size is relatively small.

Example: Small vs. Large Populations

To see why this is true, compare how the probability changes in a small group versus a massive one.

1. The Small Population (Significant Difference)

Imagine you are sampling 2 people from a small office of 10 employees.
  • With Replacement: The probability of picking any specific person is 1/10 (0.1000) for the first pick and remains 1/10 ($0.1000) for the second pick.
  • Without Replacement: The probability for the first pick is 1/10 (0.1000), but for the second pick, it changes to 1/9 (0.1111).
  • The Result: There is a noticeable difference in the second pick of 0.0111. In this case, the method of sampling significantly impacts the math.

2. The Large Population (Negligible Difference)

Now imagine you are sampling 2 people from a city of 100,000 residents.

  • With Replacement: The probability of picking a specific resident is 1/100,000 ($0.00001000) for both the first and second picks.
  • Without Replacement: The probability for the first pick is 1/100,000 (0.00001000). For the second pick, it becomes 1/99,999 (0.0000100001).
  • The Result: The difference is only 0.0000000001.
Because the difference in a large population is so microscopic, statisticians can safely use the simpler "with replacement" formulas (which assume independence) even when they are actually sampling "without replacement." As a general rule, if the sample size is less than 5% or 10% of the total population, the two methods are treated as mathematically identical.


B. Sampling Errors, Nonsampling Errors and Bias
  • Sampling Error: The natural difference between a sample and a population. This occurs because a sample is only a subset. Note: Larger samples generally reduce sampling error.
  • Nonsampling Error: Errors caused by factors unrelated to the sampling process, such as a defective measuring scale or poorly worded survey questions.
  • Sampling Bias: Occurs when some members of a population are less likely to be chosen than others, leading to incorrect conclusions.







Note: 
add in nominal and ordinal under qualitative data

Reference
Gemini Ai

https://courses.lumenlearning.com/introstats1/chapter/sampling-and-data/



No comments:

Post a Comment