6 Appendix

6.1 Description of the variables

In this section, we are providing a description of all the columns of the original dataset before the data-cleaning part was done.

YearStart and YearEnd: YearStart represents the start, and YearEnd represents the end of the period for collecting information on adult diet, physical activity, and weight status.

LocationAbbr: Identifies the Abbreviations of specific geographic locations.

LocationDesc: Provides the full names of geographic locations corresponding to the abbreviations in the "LocationAbbr" column.

Datasource: Specifies that the data originates from the Behavioral Risk Factor Surveillance System.

Class:

Description: Categorizes the data into classes related to health and lifestyle
Values: “Physical Activity,” “Obesity / Weight Status,” “Fruits and Vegetables.”
Context: Classifies the data based on health-related categories, providing insights into physical activity, obesity, and nutrition related to fruits and vegetables.

Topic (Same as “Class” Column): The ‘Topic’ column duplicates the information found in the ‘Class’ column.

Question: The ‘Question’ column encompasses a diverse array of health-related inquiries and measurements for adults, providing a comprehensive overview of various aspects such as physical activity, weight status, and dietary habits.

Data_Value_Unit: This column appears to lack information on the unit of measurement, as it consists solely of missing values (NA).

Data_Value_Type: Indicating that the column exclusively contains the actual data values, denoted by the term “Value.”

Data_Value and Data_Value_Alt: Both columns represent the same data, representing diverse data points. The values range from 0.9 to 77.6 and include missing values (NA). These columns signify quantitative measurements as a percentage that answers the questions given to the cohort. Data_Value_Footnote_Symbol and Data_Value_Footnote: These columns appear to lack information, as they consist solely of missing values (NA)

Data_Value_Footnote_Symbol and Data_Value_Footnote: These columns appear to lack information, as they consist solely of missing values (NA)

Low_Confidence_Limit: The column includes various numeric values, along with NA (indicating missing values) representing the lower limits of confidence intervals.

High_Confidence_Limit: The column includes various numeric values, along with NA (indicating missing values) representing the upper limits of confidence intervals

Sample_Size: The sample size indicates the number of observations for each corresponding data entry.

Total: This column represents the overall value without filtration by Race, education, etc.

Age(years): This column categorizes individuals into age groups, ranging from “18 - 24” to “65 or older.” “NA” values denote missing or undefined entries, valid for age-specific analysis and comparisons.

Education: This column captures the educational attainment of individuals, ranging from “High school graduate” to “College graduate,” with “NA” values indicating missing or undefined entries. It provides insights into the educational background, enabling analysis based on different education levels.

Gender: This column denotes the gender of individuals, with categories including “Female” and “Male.” The “NA” values indicate missing or undefined gender entries.

Income: This column represents the income levels of individuals, ranging from “Less than $15,000” to “$75,000 or greater.” The “NA” values indicate missing or undefined income entries. Additionally, there is a category for “Data not reported.”

Race/Ethnicity:

Description: This column categorizes individuals into different racial or ethnic groups, providing insights into the diversity of the dataset. The “NA” values indicate missing or undefined entries for race/ethnicity.
Values:
- “Hispanic”: Individuals of Hispanic ethnicity.
- “American Indian/Alaska Native”: Individuals belonging to the American Indian or Alaska Native ethnic group.
- “Asian”: Individuals of Asian ethnicity.
- “Non-Hispanic White”: Individuals of non-Hispanic White ethnicity.
- “Non-Hispanic Black”: Individuals of non-Hispanic Black ethnicity.
- “Hawaiian/Pacific Islander”: Individuals belonging to the Hawaiian/Pacific Islander ethnic group.
- “2 or more races”: Individuals identifying with two or more racial categories.
- “Other”: Individuals belonging to other ethnic or racial categories.
- NA: Missing or undefined race/ethnicity information.

GeoLocation: This column provides geographical coordinates in the format (latitude, longitude), enabling spatial analysis and insights into data distribution across different locations. “NA” values indicate the location of the USA. A total of 106 unique locations are present in the dataset.

ClassID and TopicID: The “ClassID” and “TopicID” columns both represent different categories, with “PA” corresponding to “PA1,” “OWS” to “OWS1,” and “FV” to “FV1.” Each column has three unique identifiers. These columns offer similar information, potentially serving as alternate representations.

QuestionID: The “QuestionID” column serves as a unique identifier for each question in the “Question” column.

DataValueTypeID: The “DataValueTypeID” column contains a unique value, “VALUE”.

LocationID: The “LocationID” column assigns a unique identifier to each location in the “LocationDesc” column.

StratificationCategory1: The “StratificationCategory1” column categorizes data based on various demographic factors, including “Race/Ethnicity,” “Education,” “Income,” “Age (years),” “Gender,” and “Total.” The “NA” values indicate missing or undefined entries in this categorization.

Stratification1: This column provides a detailed breakdown of the data based on various demographic factors, allowing for nuanced analysis and comparisons across different population segments.

StratificationCategoryId1: The “StratificationCategoryId1” column provides unique identifiers for the categories in the “StratificationCategory1” column.

StratificationID1: This column assigns a unique ID to each category in the Stratification1 column, enabling efficient referencing and grouping of data based on diverse demographic factors.