Wednesday, March 11, 2026

Measures of the Location of the Data

Quartiles, Percentiles, and Median

The Big Idea: Dividing Data

All three concepts are about locating positions within a dataset — they help you understand how values are distributed and where any particular value stands relative to the rest.

I. Median

The median is the middle value of an ordered dataset. It splits data into two equal halves.

How to find it:

  1. Sort your data from smallest to largest
  2. If n is odd → the median is the middle value
  3. If n is even → the median is the average of the two middle values

Example:

  • Dataset: 3, 7, 8, 12, 15 → Median = 8 (middle value)
  • Dataset: 3, 7, 8, 12 → Median = (7 + 8) / 2 = 7.5

 The median is also known as the 50th percentile, because 50% of values fall below it.

 

II. Quartiles

Quartiles divide ordered data into four equal parts (quarters).

QuartileSymbolAlso CalledWhat it means
First QuartileQ1Lower Quartile25% of data falls below this
Second QuartileQ2Median50% of data falls below this
Third QuartileQ3Upper Quartile75% of data falls below this

Example — Dataset: 2, 4, 6, 8, 10, 12, 14, 16

  • Q2 (Median) = (8 + 10) / 2 = 9
  • Q1 = median of the lower half {2, 4, 6, 8} = (4 + 6) / 2 = 5
  • Q3 = median of the upper half {10, 12, 14, 16} = (12 + 14) / 2 = 13

Interquartile Range (IQR) = Q3 − Q1 = 13 − 5 = 8 The IQR measures the spread of the middle 50% of data, and is useful for detecting outliers.


III. Percentiles

Percentiles divide data into 100 equal parts. The pth percentile is the value below which p% of the data falls.

Formula (finding the position):

L=p100×nL = \frac{p}{100} \times n

Where p = percentile and n = number of data points. If L is a whole number, average the Lth and (L+1)th values. If L is a decimal, round up.

Example — Find the 30th percentile of: 5, 10, 15, 20, 25, 30, 35, 40 (n = 8)

L=30100×8=2.4round up to position 3L = \frac{30}{100} \times 8 = 2.4 → \text{round up to position 3}

The value at position 3 is 15, so the 30th percentile = 15.


How They All Connect

0%          25%         50%         75%        100%
|___________|___________|___________|___________|
Min         Q1        Median        Q3         Max
           (25th      (50th        (75th
         percentile) percentile)  percentile)
  • Median = Q2 = 50th percentile
  • Q1 = 25th percentile
  • Q3 = 75th percentile
  • Every quartile is a percentile, but not every percentile is a quartile

Claude AI

Friday, February 27, 2026

Frequency Polygons, and Time Series Graphs (Line Graphs)

Here we distinguish between two types of line graphs. While they both use points connected by line segments, they serve two completely different statistical purposes.


I. Frequency Polygons

A Frequency Polygon is used to show the distribution of a data set. It is essentially a "connected" version of a histogram. 

A. Constructing a Frequency Polygon

1. Sort Data (Same as histogram) Arrange your data in ascending order. This helps you quickly identify the minimum and maximum values to determine the spread of the data.

2. Define Bins (Same as histogram) Decide on the number of bins and calculate the bin width.

  • Bin Width = (Maximum Value - Minimum Value) / Number of Bins.

  • Note: Ensure your bin widths are consistent throughout the graph.

3. Create a Frequency Table with Midpoints List your intervals and tally the frequencies as you would for a histogram. However, for a frequency polygon, you must also calculate the midpoint for each interval.

  • Formula: Midpoint = (Lower Boundary + Upper Boundary) / 2.

  • Example: If a bin range is 100–110, the midpoint is 105.

4. Draw and Label the Axes 

  • Horizontal Axis (X-axis): Label this with the measured variable (e.g., "Time in Hours"). Instead of marking just the boundaries, mark the midpoints you calculated in Step 3.
  • Vertical Axis (Y-axis): Label this as "Frequency" or "Relative Frequency," ensuring the scale starts at zero.

5. Plot Points and Connect Instead of drawing vertical bars, you will create a line graph:

  • Plot the Points: For each bin, place a dot at the intersection of its midpoint on the X-axis and its frequency on the Y-axis.

  • Connect the Dots: Use a straight edge to connect the points in order from left to right.



Example:



S

One advantage of a frequency polygon is that it allows histogram-like data representation of two sets of data on the same graph. Two histograms on the same graph tend to shroud each other and make comparison more difficult, but two frequency polygons can be graphed together with much less interference

The figure below provides an example. The data come from a task in which the goal is to move a computer cursor to a target on the screen as fast as possible. On 20 of the trials, the target was a small rectangle; on the other 20, the target was a large rectangle. Time to reach the target was recorded on each trial. The two distributions (one for each target) are plotted together. The figure shows that, although there is some overlap in times, it generally took longer to move the cursor to the small target than to the large one.


II. Time Series Graphs
Though time series graphs look similar to frequency polygons, they display. Frequency polygons display the distribution of a data set (how often values fall into class intervals). Time series graphs shows how a specific variable changes over time. The horizontal axis is always time (days, months, years, etc.), and the vertical axis is the recorded value at each time point (such as temperature or sales).


A. Constructing a Time Series Graph

To construct a time series graph, we must look at both pieces of our paired data set using a standard Cartesian coordinate system. The horizontal axis is used to plot the date or time increments, and the vertical axis is used to plot the values of the variable that we are measuring. By doing this, we make each point on the graph correspond to a date and a measured quantity. The points on the graph are typically connected by straight lines in the order in which they occur.


1. Organize Data Arrange your data points into a table in chronological order.

2. Draw and Label the Axes 
  • Horizontal Axis (X-axis): This is the time variable.
  • Vertical Axis (Y-axis): This is the quantity being measured. 

3. Plot the data points
  • For each pair (time, value), find the position on the graph where that time on the x-axis meets that value on the y-axis.
  • Place a small point or dot at each of these positions.
4. Connect the points
  • Starting with the earliest time, connect consecutive points with straight line segments to show how the variable changes over time.
Example
The following data is about life expectancy in the U.S. from 1920-2000














https://www.socscistatistics.com/charts/frequencydistribution/calculator/
https://onlinestatbook.com/2/graphing_distributions/freq_poly.html

Wednesday, February 25, 2026

Histograms

I. Intro

A histogram is a specific type of data visualization that shows the distribution of a continuous variable. While it looks similar to a bar chart, it serves a very different purpose: instead of comparing discrete categories (like "Apples" vs. "Oranges"), it groups numerical data into ranges called bins. It is essentially a visual representation of frequency table for continuous data. 


The Horizontal Axis (x-axis): The Bins
This shows the variable you measured. It is what the data represents (for example, test scores, heights, ages) and is divided into contiguous groups called classes, intervals or bins that cover the range of the data. These bins are graphically shown as adjacent bars. Usually, every bar is equal (e.g., intervals of 10 units). The bars touch because there is no space between the end of one numerical range and the start of the next (note that the bars in a bar chart usually do not touch). 

The Vertical Axis (y-axis): The Frequency
The y-axis represents the count (or frequency). It tells you how many data points from your set fall into each specific bin. This can represent Frequency (total count) or Relative Frequency (the percentage of the total). The shape of the graph remains identical regardless of which you choose.
  • Formula: Relative Frequency = Frequency/Total Number of Data Values 


II. Constructing a Histogram

1. Sort Data 
  • Sort your data in ascending order. This identifies the range and makes counting easier. 
2. Define Bins
  • Decide how many bins. Depending on the actual data distribution and the goals of the analysis, different bin widths may be appropriate. There are various useful guidelines and rules of thumb that we will go over below.
  •  Once you know the number of bins, you need to determine the bin width. To ensure your bars are consistent, calculate the width of each interval: 
    • Bid Width = Maximum Value - Minimum Value/Number of Bins. Adjust as needed. 
3. Create a Frequency Table
  • List your intervals and tally how many data points fall into each. Add relative frequency column if your histogram is displaying relative frequency. 

4. Draw and Label the Axes
  • Horizontal Axis (X-axis): Label this with what the data represents (e.g., "Height in Inches" or "Test Scores") and mark the bin boundaries.
  • Vertical Axis (Y-axis): Label this as "Frequency" or "Relative Frequency." Ensure the scale starts at zero to avoid distorting the data.
5. Draw the Bars
  • The Height: Matches the frequency of that bin. 
  • The Width: Spans the entire interval on the X-axis

Example:
A manufacturer of AAA batteries had their quality control department test the lifespan of their batteries. Forty-two batteries were randomly selected and tested, with the number of hours they lasted listed below.

The Raw Data Set: 
108 125 137 110 167 158 142
168 163 121 134 146 135 163
148 153 169 154 156 142 160
147 119 124 145 167 161 155
138 126 149 168 151 129 157
115 124 165 152 159 144 163

Step 1: Sort Data

First, we take the scattered numbers and arrange them in ascending order. This allows us to easily find the Minimum (108) and Maximum (169) values to determine the spread. 

Sorted
108, 110, 115, 119, 121, 124, 124, 125, 126, 129, 134, 135, 137, 138, 142, 142, 144, 145, 146, 147, 148, 149, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 163, 163, 163, 165, 167, 167, 168, 168, 169


Step 2: Define Bins

We need to group these into intervals (bins). Let's aim for 7 bins.

Range = Largest value - Smallest value = 169 - 108 = 61

Bin Width: 61 / 7 = 8.7

To keep it simple for the graph, we’ll round the bin width to 10 and start our first bin at 100.

Step 3: Create a Frequency Table

Now we tally the sorted data into our defined intervals .





Step 4: Draw and Label the Axes

  • Horizontal Axis (x axis): Label as "Time (Hours) and mark the bin boundaries (100 to 170).
  • Vertical Axis (y axis): Label as "Number of Batteries" and add the range (0 to 12).

Step 5: Draw the Bars

Each bar is drawn to its specific frequency. Notice that the bars touch each other, indicating a continuous scale of data. The highest frequency occurs in the 160–169 bin with 11 values.











Gemini AI

Tuesday, February 24, 2026

Stem-and-Leaf Graphs (Stemplots)

A stem-and-leaf plot (or stem-and-leaf display) is a unique way to organize numerical data so that you can see the overall shape of the distribution while still keeping every individual data point visible. It is essentially a hybrid between a table and a histogram.

To construct a stem-and-leaf display, the observations must first be sorted in ascending order
Here is the sorted set of data values that will be used in the following example:

44, 46, 47, 49, 63, 64, 66, 68, 68, 72, 72, 75, 76, 81, 84, 88, 106

Next, it must be determined what the stems will represent and what the leaves will represent. Typically, the leaf contains the last digit of the number and the stem contains all of the other digits. In the case of very large numbers, the data values may be rounded to a particular place value (such as the hundreds place) that will be used for the leaves. The remaining digits to the left of the rounded place value are used as the stem. In this example, the leaf represents the ones place and the stem will represent the rest of the number (tens place and higher).

The stem-and-leaf display is drawn with two columns separated by a vertical line. The stems are listed to the left of the vertical line. It is important that each stem is listed only once and that no numbers are skipped, even if it means that some stems have no leaves. The leaves are listed in increasing order in a row to the right of each stem.



The stemplot is a quick way to graph data and gives an exact picture of the data. You want to look for an overall pattern and any outliers. An outlier is an observation of data that does not fit the rest of the data. It is sometimes called an extreme value. When you graph an outlier, it will appear not to fit the pattern of the graph. Some outliers are due to mistakes (for example, writing down 50  instead of 500) while others may indicate that something unusual is happening.














https://www.mathsisfun.com/data/stem-leaf-plots.html
Gemini AI

Monday, February 16, 2026

Intro to Descriptive Statistics

I. Overview
Descriptive statistics involves summarizing and organizing data from a sample to reveal its key characteristics, without making broader inferences about a population. It doesn’t try to make predictions or reach conclusions about a larger population (that’s inferential statistics); it simply describes exactly what you have in front of you.

Main Types
1. Measures of Central Tendency
This tells you where the "middle" of the data sits.
  • Mean: The mathematical average.
  • Median: The middle value when the data is lined up in order.
  • Mode: The most frequently occurring value.
2. Measures of Variability (Spread)
This tells you how "stretched out" or "clustered" your data is.
  • Range: The distance between the highest and lowest values.
  • Standard Deviation: How much the data points typically deviate from the mean.
  • Variance: The squared version of standard deviation, representing the degree of spread
3. Frequency Distribution
This is often shown as a table or a graph (like a histogram) that shows how often each individual value occurs. It helps you see the "shape" of your data—whether it's a perfect bell curve or skewed to one side.

II. Graphs
Descriptive statistics often uses graphs as a tool that helps you learn about the shape or distribution of a sample or a population. A graph can be a more effective way of presenting data than a mass of numbers because we can see where data clusters and where there are only a few data values.

Some of the types of graphs that are used to summarize and organize data are:
  • the dot plot
  • the bar graph
  • the histogram
  • the stem-and-leaf plot
  • the frequency polygon (a type of broken line graph)
  • the pie chart
  • the box plot.









Gemini AI

Friday, February 13, 2026

Experimental Design and Ethics

Does aspirin reduce the risk of heart attacks? Is one brand of fertilizer more effective at growing roses than another? Is fatigue as dangerous to a driver as the influence of alcohol? Questions like these are answered using randomized experiments. In this module, you will learn important aspects of experimental design. Proper study design ensures the production of reliable, accurate data.

I. Observational Study vs. Experiment
Two types of studies that are commonly used in statistics are observational and experimental studies. There are distinct differences between the two types.

1. Observational Studies
In an observational study, the sample population being studied is measured, or surveyed, as it is. The researcher observes the subjects and measures variables, but does not influence the population in any way or attempt to intervene in the study. There is no manipulation by the researcher. Instead, data is simply gathered and correlations are investigated. Since observational studies do not manipulate any variable, the results can only allow the researcher to claim association, not causation (not a cause-and-effect conclusion).

2. Controlled Experiment
Unlike an observational study, an experimental study has the researcher purposely attempting to manipulate the variables. The goal is to determine what effect a particular treatment has on the outcome. Researchers take measurements or surveys of the sample population. The researchers then manipulate the sample population in some manner. After the manipulation, the researchers re-measure, or re-survey, using the same procedures to determine if the manipulation possibly changed the measurements. Since variables are controlled in a designed experiment, the results allow the researcher to claim causation (a cause-and-effect conclusion).

II. The Goal of an Experiment

The primary purpose of an experiment is to investigate the relationship between two variables. Specifically, researchers want to see if changing one variable causes a change in another. 

When one variable causes change in another, we call the first variable the explanatory variable. The affected variable is called the response variable. In a randomized experiment, the researcher manipulates values of the explanatory variable and measures the resulting changes in the response variable. The different values of the explanatory variable may be called treatments. An experimental unit is a single object or individual to be measured. 

The main principles we want to follow in experimental design are:
  • Randomization
  • Replication
  • Control

A. Randomization
In order to provide evidence that the explanatory variable is indeed causing the changes in the response variable, it is necessary to isolate the explanatory variable. The researcher must design their experiment in such a way that there is only one difference between groups being compared: the planned treatments. This is accomplished by randomization of experimental units to treatment groups. When subjects are assigned treatments randomly, all of the potential lurking variables are spread equally among the groups. At this point the only difference between groups is the one imposed by the researcher. Different outcomes measured in the response variable, therefore, must be a direct result of the different treatments. In this way, an experiment can show an apparent cause-and-effect connection between the explanatory and response variables.


B. Replication
The more cases researchers observe, the more accurately they can estimate the effect of the explanatory variable on the response. In a single study, we replicate by collecting a sufficiently large sample. Additionally, a group of scientists may replicate an entire study to verify an earlier finding. Having individuals experience a treatment more than once, called repeated measures is often helpful as well.

C. Control
The power of suggestion can have an important influence on the outcome of an experiment. Studies have shown that the expectation of the study participant can be as important as the actual medication. In one study of performance-enhancing drugs, researchers noted:

Results showed that believing one had taken the substance resulted in [performance] times almost as fast as those associated with consuming the drug itself. In contrast, taking the drug without knowledge yielded no significant performance increment. 

It is often difficult to isolate the effects of the explanatory variable. To counter the power of suggestion, researchers set aside one treatment group as a control group. This group is given a placebo treatment–a treatment that cannot influence the response variable. The control group helps researchers balance the effects of being in an experiment with the effects of the active treatments. Of course, if you are participating in a study and you know that you are receiving a pill which contains no actual medication, then the power of suggestion is no longer a factor. Blinding in a randomized experiment preserves the power of suggestion. When a person involved in a research study is blinded, he does not know who is receiving the active treatment(s) and who is receiving the placebo treatment. A double-blind experiment is one in which both the subjects and the researchers involved with the subjects are blinded.

Control Group - A group in a randomized experiment that receives no (or an inactive) treatment but is otherwise managed exactly as the other groups

Placebo - An inactive treatment that has no real effect on the explanatory variable (ex. sugar pill or saline injection). 

Blinding - A procedure where the participants in a study are kept unaware of whether they are in the treatment group or the control group.

Double BlindingThe act of blinding both the subjects of an experiment and the researchers who work with the subjects


Example 1
Researchers want to investigate whether taking aspirin regularly reduces the risk of heart attack. Four hundred men between the ages of 50 and 84 are recruited as participants. The men are divided randomly into two groups: one group will take aspirin, and the other group will take a placebo. Each man takes one pill each day for three years, but he does not know whether he is taking aspirin or the placebo. At the end of the study, researchers count the number of men in each group who have had heart attacks.

Identify the following values for this study: population, sample, experimental units, explanatory variable, response variable, treatments.


Solution:
The population is men aged 50 to 84.
The sample is the 400 men who participated.
The experimental units are the individual men in the study.
The explanatory variable is oral medication.
The treatments are aspirin and a placebo.
The response variable is whether a subject had a heart attack.



III. Ethics
The widespread misuse and misrepresentation of statistical information often gives the field a bad name. Some say that “numbers don’t lie,” but the people who use numbers to support their claims often do.



Professional organizations, like the American Statistical Association, clearly define expectations for researchers. There are even laws in the federal code about the use of research data.

When a statistical study uses human participants, as in medical studies, both ethics and the law dictate that researchers should be mindful of the safety of their research subjects. The U.S. Department of Health and Human Services oversees federal regulations of research studies with the aim of protecting participants. When a university or other research institution engages in research, it must ensure the safety of all human subjects. For this reason, research institutions establish oversight committees known as Institutional Review Boards (IRB). All planned studies must be approved in advance by the IRB. Key protections that are mandated by law include the following:
  • Risks to participants must be minimized and reasonable with respect to projected benefits.
  • Participants must give informed consent. This means that the risks of participation must be clearly explained to the subjects of the study. Subjects must consent in writing, and researchers are required to keep documentation of their consent.
  • Data collected from individuals must be guarded carefully to protect their privacy.












https://pressbooks.lib.vt.edu/introstatistics/chapter/experimental-design-and-ethics/

https://www.khanacademy.org/math/statistics-probability/designing-studies/types-studies-experimental-observational/a/observational-studies-and-experiments

https://mathbitsnotebook.com/Algebra2/Statistics/STSurveys.html

Thursday, February 12, 2026

Frequency, Frequency Tables, and Levels of Measurement


Part 1: Levels of Measurement (The "NOIR" Scale)

Before you can analyze data, you must understand what "type" of data you have. The level of measurement (scales of measurement) determines which mathematical operations (like addition or averaging) are allowed. We classify data into four levels, often remembered by the acronym NOIR. Going from lowest to highest, the 4 levels of measurement are cumulative. This means that they each take on the properties of lower levels and add new properties.


1. Nominal Scale Level (The "Naming" Level)

The word "Nominal" comes from the Latin nomen, meaning "name." At this level, numbers or words are used solely as identifiers or categories. There is no mathematical value to the labels themselves.

  • Key Characteristics: Data is qualitative and mutually exclusive (you belong to one category or another). You cannot say one category is "more" or "less" than another.
  • The "Number" Trap: Sometimes we use numbers for nominal data, like Zip Codes or Jersey Numbers. You can’t add two zip codes together to get a "better" location; the number is just a shortcut for a name.
  • Examples: Eye color, gender, political party, types of flooring, or "Yes/No" survey responses.
  • Mathematical Limit: You can only calculate the Mode (the most common category).

2. Ordinal Scale Level (The "Ordering" Level)

The "Ordinal" level introduces rank. It tells you the position of data points relative to each other, but it doesn't tell you how much better or bigger one is than the other.

  • Key Characteristics: There is a logical sequence or "natural order." However, the intervals between the ranks are unknown or inconsistent.
  • The "Gap" Problem: If you come in 1st, 2nd, and 3rd in a race, the ordinal scale tells us the order of finish. It does not tell us if the 1st place runner beat 2nd place by one second or ten minutes.
  • Examples: Likert scales (Strongly Disagree to Strongly Agree), class rank (Valedictorian, Salutatorian), or "Small, Medium, Large" drink sizes.
  • Mathematical Limit: You can find the Mode and the Median (the middle rank), but you cannot calculate a meaningful Mean (average).

3. Interval Scale Level (The "Equal Spacing" Level)

The "Interval" level gives us order and tells us that the distance between each point is exactly the same. However, it lacks a "true zero."

  • Key Characteristics: The difference between 70° and 80° is exactly the same as the difference between 30° and 40°. Because the intervals are equal, addition and subtraction become possible.
  • The "Zero" Problem: Zero on an interval scale is just another point on the line; it does not mean "nothing." For example, $0^\circ\text{C}$ doesn't mean there is "no temperature"—it’s just the freezing point of water. Because there is no starting point, you cannot make "twice as much" statements.
  • Examples: Temperature (Fahrenheit/Celsius), IQ scores, and Years (the year 0 is an arbitrary point in time, not the "beginning of time").
  • Mathematical Limit: You can calculate the Mean, Median, and Mode. You can add and subtract, but you cannot multiply or divide (no ratios).


4. Ratio Scale Level (The "True Zero" Level)

This is the "gold standard" of measurement. It has all the properties of the previous three, but adds a True Zero point. Zero actually means "the absence of the thing being measured."
  • Key Characteristics: Because there is an absolute zero, you can finally perform multiplication and division. You can meaningfully say one value is "double" or "half" of another.
  • The "Ratio" Advantage: If you weigh 200 lbs and your friend weighs 100 lbs, you are exactly twice as heavy. This is only possible because 0 lbs means "no weight."
  • Examples: Weight, height, distance, time duration (e.g., "it took 5 minutes"), and money ($0 means you have no money).
  • Mathematical Limit: All statistical operations are allowed. This is the level required for the most advanced types of data analysis.

Part 2: Organizing Data with Frequency Tables

Once you have collected data (especially Nominal or Ordinal data), you need to organize it to see patterns. We use Frequency Tables to do this.

Key Definitions

1. Frequency (f): The number of times a specific value occurs.

2. Relative Frequency: The proportion (or percent) of the total data that a value represents.  

    Formula: Frequency/Total Number of Data Points

3. Cumulative Relative Frequency: The running total of relative frequencies. It shows the percentage of data that falls at or below a certain value.

Example:
Suppose 20 students reported the hours they worked yesterday. Their responses were as follows:
5, 6, 3, 3, 2, 4, 7, 5, 2, 3, 5, 6, 5, 4, 4, 3, 5, 2, 5, 3

The following table lists the different data values in ascending order and their frequencies.


A frequency is the number of times a value of the data occurs. According to the table, there are three students who work two hours, five students who work three hours, and so on. The sum of the values in the frequency column, 20, represents the total number of students included in the sample.

A relative frequency is the ratio (fraction or proportion) of the number of times a value of the data occurs in the set of all outcomes to the total number of outcomes. To find the relative frequencies, divide each frequency by the total number of students in the sample–in this case, 20. Relative frequencies can be written as fractions, percents, or decimals.

Cumulative relative frequency is the accumulation of the previous relative frequencies. To find the cumulative relative frequencies, add all the previous relative frequencies to the relative frequency for the current row, as shown in the table below.

Part 3: Rounding

When calculating the frequency, you may need to round your answers so that they are as precise as possible. A simple way to round off answers is to carry your final answer one more decimal place than was present in the original data. Round off only the final answer. Do not round off any intermediate results, if possible. If it becomes necessary to round off intermediate results, carry them to at least twice as many decimal places as the final answer.

 For example, the average of the three quiz scores four, six, and nine is 6.3333.... Since the data are whole numbers, we would round this to 6.3.





Reference
Gemini AI




Saturday, February 7, 2026

Sampling and Data


I. Data Classification

Data may come from a population or from a sample. Small letters like x or y generally are used to represent data values. Most data can be put into the following categories:
  • Qualitative (Categorical)
  • Quantitative (Numerical)
Qualitative data are the result of categorizing or describing, attributes (qualities or characteristics) of a population. It classifies individuals into groups or categories. Qualitative data are also often called categorical data. Hair color, blood type, ethnic group, the car a person drives, and the street a person lives on are examples of qualitative data. Qualitative data are generally described by words or letters. For instance, hair color might be black, dark brown, light brown, blonde, gray, or red. Blood type might be AB+, O-, or B+. Researchers often prefer to use quantitative data over qualitative data because it lends itself more easily to mathematical analysis. For example, it does not make sense to find an average hair color or blood type.

Example: 
A researcher records the blood type (A, B, AB, or O) of 50 patients. What kind of data is this?

Answer: Qualitative


Quantitative data are always numbers. Quantitative data are the result of counting or measuring attributes of a population. Amount of money, pulse rate, weight, number of people living in your town, and number of students who take statistics are examples of quantitative data. Quantitative data may be either discrete or continuous.

  • Quantitative Discrete Data: Data that are the result of counting. These data take on only certain numerical values (usually whole numbers). If you count the number of phone calls you receive for each day of the week, you might get values such as zero, one, two, or three.
  • Quantitative Continuous: All data that are the result of measuring (and hence doesn't have to be whole numbers). If you and your friends carry backpacks with books in them to school, the numbers of books in the backpacks are discrete data and the weights of the backpacks are continuous data.
Example:
A fitness app tracks the number of steps a user takes each day. What kind of data is this?

Answer: Quantitative Discrete (counting steps)

Example:
 A fitness app tracks the distance a user walks each day. What kind of data is this?

Answer: Quantitative Continuous (measuring distance)



II. Data Visualization

Omitting Categories, Missing Data and Chart Selection

The table displays Ethnicity of Students but is missing the “Other/Unknown” category. This category contains people who did not feel they fit into any of the ethnicity categories or declined to respond. Notice that the frequencies do not add up to the total number of students. In this situation, create a bar graph and not a pie chart.

The following graph is the same as the previous graph but the “Other/Unknown” percent (%) has been included. The “Other/Unknown” category is large compared to some of the other categories (Native American, %, Pacific Islander %). This is important to know when we think about what the data are telling us.

This particular bar graph in Figure 2 can be difficult to understand visually. The graph in Figure 3 is a Pareto chart. The Pareto chart has the bars sorted from largest to smallest and is easier to read and interpret
*Note: This lessons mention of a Pareto Chart is superficial and incomplete. A Pareto chart is a type of chart that contains both bars and a line graph, where individual values are represented in descending order by bars, and the cumulative total is represented by the line.

III. Sampling Methods

Gathering information about an entire population often costs too much or is virtually impossible. Instead, we use a sample of the population. A sample should have the same characteristics as the population it is representing. Most statisticians use various methods of random sampling in an attempt to achieve this goal. This section will describe a few of the most common methods. There are several different methods of random sampling. In each form of random sampling, each member of a population initially has an equal chance of being selected for the sample. Each method has pros and cons.

A. Random Sampling (Scientific/Unbiased)

1. Simple Random Sample (SRS): A simple random sample (SRS) is a subset of a larger population where every single member of the population has an equal chance of being selected.

Example:
Suppose Lisa is in a pre-calculus class with 31 other students. She wants to form a 4-person study group consisting of herself and 3 other people from the class. To make it fair, she decides to use a Simple Random Sample to choose her partners. She writes the names on slips of paper, puts them in a drum and mixes them up thoroughly. She then reaches in and pulls out 3 names. Alternatively, she could have assigned each name a number then run the numbers through a random number computer program.

2. Stratified Sample: The population is divided into subgroups called strata based on a specific characteristic (like age, gender, or income). You then take a random sample from every subgroup to ensure they are all represented.

Example:
A high school principal wants to know student opinions on a new dress code. To make sure every grade has a voice, they divide the students into four strata: Freshmen, Sophomores, Juniors, and Seniors. They then randomly select 25 students from each grade level.

3. Cluster Sample: The population is divided into many subgroups called clusters (often based on geography). You randomly select a few entire clusters and collect data from everyone inside those selected groups.

Example:
A fast-food chain wants to survey its employees nationwide. Instead of visiting every store, they randomly select 10 specific locations (clusters) across the country. They then interview every single employee working at those 10 locations, while ignoring the employees at all other locations

4. Systematic Sample: Select a random starting point and take every n'th member from a list.

Example:
A quality control manager at a lightbulb factory wants to test for defects. They decide to test every 50th lightbulb that comes off the assembly line. They pick a random number between 1 and 50 to start (let's say 12), and then test the 12th, 62nd, 112th, and 162nd bulbs.

B. Non-Random Sampling (Biased)

Convenience Sampling: Using results that are readily available (e.g., asking people in a mall). This often leads to Bias, where certain outcomes are favored over others.

IV. The Mechanics of Sampling

A. With vs. Without Replacement

As mentioned earlier, with the simple random survey, every single member of the population has an equal chance of being selected. This type of sampling is referred to as sampling with replacement. Sampling with replacement refers to the process where an item is selected from a population, and after being selected, it is "replaced" back into the population before the next selection. This means that the same item can be chosen multiple times in the same sampling process. 

Sampling without replacement refers to sampling where once a member of the population is selected, they are removed from the pool and cannot be picked again. This is the standard method for surveys and polls

While these two methods are technically different, they become mathematically almost equivalent when the population is large and the sample size is relatively small.

Example: Small vs. Large Populations

To see why this is true, compare how the probability changes in a small group versus a massive one.

1. The Small Population (Significant Difference)

Imagine you are sampling 2 people from a small office of 10 employees.
  • With Replacement: The probability of picking any specific person is 1/10 (0.1000) for the first pick and remains 1/10 ($0.1000) for the second pick.
  • Without Replacement: The probability for the first pick is 1/10 (0.1000), but for the second pick, it changes to 1/9 (0.1111).
  • The Result: There is a noticeable difference in the second pick of 0.0111. In this case, the method of sampling significantly impacts the math.

2. The Large Population (Negligible Difference)

Now imagine you are sampling 2 people from a city of 100,000 residents.

  • With Replacement: The probability of picking a specific resident is 1/100,000 ($0.00001000) for both the first and second picks.
  • Without Replacement: The probability for the first pick is 1/100,000 (0.00001000). For the second pick, it becomes 1/99,999 (0.0000100001).
  • The Result: The difference is only 0.0000000001.
Because the difference in a large population is so microscopic, statisticians can safely use the simpler "with replacement" formulas (which assume independence) even when they are actually sampling "without replacement." As a general rule, if the sample size is less than 5% or 10% of the total population, the two methods are treated as mathematically identical.


B. Sampling Errors, Nonsampling Errors and Bias
  • Sampling Error: The natural difference between a sample and a population. This occurs because a sample is only a subset. Note: Larger samples generally reduce sampling error.
  • Nonsampling Error: Errors caused by factors unrelated to the sampling process, such as a defective measuring scale or poorly worded survey questions.
  • Sampling Bias: Occurs when some members of a population are less likely to be chosen than others, leading to incorrect conclusions.







Note: 
add in nominal and ordinal under qualitative data

Reference
Gemini Ai

https://courses.lumenlearning.com/introstats1/chapter/sampling-and-data/



Friday, February 6, 2026

Key Terms: Statistics & Probability

I. Fundamental Concepts
Part 1: The Groups (Who?)

In statistics, we usually want to learn about a massive group, but we can only afford to check a small part of it.

Population
Definition: The entire collection of people, things, or objects you want to study. It is the "whole picture."

Key Concept: In the real world, populations are often too large to check completely (e.g., "All voters in the USA" or "All items produced by a factory in 2024").

Sample
Definition: A smaller subset selected from the population.

Key Concept: The goal is to select a Representative Sample—a group that accurately reflects the characteristics of the full population. If your sample is good, the results will apply to the whole population.

Part 2: The Measurements (What?)
Once we have our sample, we need to gather information.

Variable
Definition: A characteristic of interest for each person or object in a population. Essentially it is the "question" you are studying. Variables are notated by capital letters such as X and Y.

Numerical Variable (Quantitative): Something you count or measure (e.g., Weight, Age, Amount of Money Spent). The litmus test is to ask if it makes meaningful sense to calculate an average. 
    -Average Age? Yes (Numerical)
    -Average Zip Code? No. Though a number, you can't have an "average location." (Categorical).

Categorical Variable (Qualitative): These variables place individuals into groups or categories. The answer is a label or a word.(e.g., Political Party, Hair Color, Yes/No). The litmus test is to ask if the answer 


Data
Definition: The actual values (answers) you collect for the variable. Datum is a single value

Key Distinction: The Variable is the concept (e.g., "Age"); the Data is the result (e.g., "18, 21, 19").

Part 3: The Numbers (The Results)
This is the most critical distinction in statistics. The name of the number changes depending on where the data came from.

Parameter
Definition: A number that describes the Population.

Key Concept: This is usually the "unknown truth" because we rarely have data for the entire population.

Statistic
Definition: A number that describes the Sample.

Key Concept: This is an estimate. We calculate the Statistic from our sample data to estimate the unknown Parameter.


Putting It Together: A Single Example
Let's apply all six terms to one scenario to see how they fit together.

The Study: We want to know the average amount of money first-year students at ABC College spend on school supplies.

Term              Applied to this Study
Population     All first-year students at ABC College.
Sample         100 specific students we surveyed.
Variable        The amount of money spent (excluding books).
Data             The specific dollar amounts listed: $150, $200, $225, etc.
Statistic        The average calculated from the 100 students (e.g., $191.67).
Parameter     The "true" average for all students (which remains unknown unless we ask everyone).










Gemini AI

https://courses.lumenlearning.com/introstats1/chapter/definitions-of-statistics-probability-and-key-terms/

Tuesday, February 3, 2026

Permutations (With Repetition)





Permutations with Repetition are arrangements of items where the same item can be selected more than once. Unlike standard permutations, where an item is "used up" after being picked, in this scenario, the item is conceptually "put back" into the pool and is available to be chosen again for the next position.

Because the pool of options never shrinks, the number of choices remains constant for every "slot" you need to fill.























Friday, January 30, 2026

Permutations (Without Repitition)

Permutations (Without Repitition)
In the prior lesson, we stated that a permutation is an arrangement of items where the specific sequence is important. In these types of permutations, we assume no repetition. This means once an item is placed in a position, it cannot be selected again. The formula is:


Where:
  • P: Represents Permutations (the count of possible arrangements).
  • n: The total number of distinct items available to choose from.
  • r: The number of items being selected and arranged. 
Note that P(n,r) is also often written as nPr

It is important to 

To understand this formula, we will first look at it's underlying concepts. 

I. Fundamental Counting Principle
The fundamental principle of counting (aka, Rule of Products) is a basic concept used to determine the total number of possible outcomes in a situation where there are multiple independent events. It allows us to count a large number of possibilities without needing to list each one individually.

The fundamental Counting Principle states that if there are a ways (choices) of doing something and b ways (choices) of doing another thing, then there are a x b ways (choices) of performing both actions.

Example: The Ice Cream Shop

You can choose a cone: Waffle or Sugar (2 options).

You can choose a flavor: Vanilla, Chocolate, or Strawberry (3 options).

How many distinct ice cream cones can you make?

2 x 3 = 6 possibilities


II. From Counting to Factorials (n!)
Now let's apply that same counting principle to arranging objects.

Example: Arranging Books
Imagine you have 3 different books (A, B, and C) and you want to put them in order on a shelf. How many different ways can you arrange them?

  1. Slot 1: How many choices do you have for the first spot? 3 (A, B, or C).
  2. Slot 2: You put one book down. How many are left for the second spot? 2.
  3. Slot 3: You put the second book down. How many are left? 1.
According to the Fundamental Counting Principle, we multiply these choices:

3 x 2 x 1 = 6 arrangements

In mathematics, we arrange items so often that we created a shorthand symbol for "multiplying a number by every integer smaller than it down to 1." This is the Factorial (!).

5! = 5 x 4 x 3 x 2 x 1

3! = 3 x 2 x 1

1! = 1

Note the special case of the factorial of 0 = 1

0! = 1 

n! = n x (n-1) x (n-2) x ... x 1

Key Takeaway: Factorials (n!) calculate the number of ways to arrange ALL items in a set.

III. Deriving the Permutations Formula

This is where many students get lost. What if we don't want to arrange all the books? What if we have a big group, but we only want to arrange a few of them?

This is called a Permutation.

  • n = Total items available.

  • r = Number of items we are actually choosing and arranging.

The Logic (Without the Formula)
Let's use the Fundamental Counting Principle again.

Scenario: Eight runners are in a race (n=8). We need to award Gold, Silver, and Bronze medals (r=3). We are not arranging all 8 runners, just the top 3.

  1. Gold Medal: Any of the 8 runners can win.

  2. Silver Medal: The winner can't win twice, so 7 runners are left.

  3. Bronze Medal: The top two are occupied, so 6 runners are left.

Using the Counting Principle:

8 x 7 x 6 - 366 ways

Finding the Formula 
How do we represent that "partial factorial" mathematically?

We calculated: 8 x 7 x 6.

We know that the full factorial is: 8! = 8 x 7 x 6 x 5 x 4 x 3 x 2 x 1

We want to mathematically "remove" the bold part we didn't use. In multiplication, we "remove" things by dividing.

The part we want to remove (5 x 4 x 3 x 2 x 1) is actually just 5!.

So, we can write our calculation as:
Where did the 5 come from? The 5 is the number of people who lost (did not get a medal).

Losers = Total Runners - Winners 
5 = 8 - 3

So the formula is actually
This leads us directly to the formula for permutations (nPr)

Note:
In practice, when solving these problems you will likely just use a shorthand method for what the formula formally represents. That is, you will:
1. Look at r to find how many numbers will be in your multiplication chain.
2. Then start with n as the first number in your chain.
3. Multiply by decreasing integers (n - 1, n - 2, etc)
4. Stop when you have reached r numbers.

I'll call this the countdown method

V. Solving Permutations
Example:
Solve for P(6, 3)

Step1: We look at r (which here is 3). This means you need to multiply 3 numbers.
Step 2: Start with n (for this problem is 6)
Step 3: Multiply downward until you have 3 numbers in your chain:

6 x 5 x 4 = 120

Now I'll solve using the formula:

P(6, 3) = 6 x 5 x 4 x 3 x 2 x 1/ 3 x 2 x 1

P(6, 3) = 720/6 = 120








Gemini AI


https://www.geeksforgeeks.org/maths/fundamental-principle-of-counting/

https://www.mathsisfun.com/data/basic-counting-principle.html

https://en.wikipedia.org/wiki/Rule_of_product

Permutations and Combinations Compared

In probability and statistics, combinations  and permutations are the two primary concepts to count how many ways a set of items can be selected or arranged. The fundamental difference between them comes down to one thing: order. If the order doesn't matter, it is a combination. If the order does matter, it is a permutation.

I. Intro to Combinations and Permutations
A. Combinations (Order Does Not Matter)
A combination is a selection of items where the sequence is irrelevant. You are only concerned with which items are picked, not how they are arranged.

A simple example would be a fruit salad. A mix of apples, grapes, and bananas is the same salad as bananas, grapes, and apples. The group is the same regardless of what went into the bowl first.

There are basically two types of combinations:

1. Combinations without Repetition


2. Combinations with Repetition



B. Permutations (Order Matters)
A permutation is an arrangement of items where the specific sequence is important. If you change the order of the items, you have a new permutation. 

A simple example would be a door lock code. If the code is 1-2-3, entering 3-2-1 will not open the door. Even though the numbers are the same, the order makes them a distinct "arrangement."

There are basically two types of permutations:

1.Permutations without Repetition


2. Permutations with Repetition






https://www.mathsisfun.com/combinatorics/combinations-permutations.html

https://virtuallearningacademy.net/LessonDisplay/Lesson6243/MATHALGIIBU33Probability_Combinations.pdf

https://www.geeksforgeeks.org/maths/permutations-and-combinations/