About This notebook¶

This is notebook imports O*NET career training datasets, wrangles them, and then reinserts them into the parent postgresql db.

When completed, the idea is that this notebook is run once, transferring the data in as raw and pristine of a fashion as possible to the postgresql database.

If the data in the database is found to be unsatisfactory, the associated database tables would need to be dropped, this file would need to be adapted as necessary, and then run again.

Link to Data¶

https://www.onetcenter.org/database.html#all-files

Project Setup¶

Import and Verifications¶

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import sqlalchemy
In [2]:
print(sqlalchemy.__version__)
2.0.48
In [3]:
from sqlalchemy import create_engine, inspect
from sqlalchemy import text
from sqlalchemy.schema import CreateSchema
from sqlalchemy import BigInteger, Column, Integer, String, Float, Boolean
from sqlalchemy.orm import DeclarativeBase, Mapped, mapped_column

Create SQL Engine and Test Connection¶

In [4]:
engine = create_engine('postgresql+psycopg2://postgres@localhost/train_reward_compare')
In [5]:
with engine.connect() as conn:
    result = conn.execute(text("select count(*) from onet.occupation_data"))
    print(result.fetchone())
(1016,)

Import Data from PostgreSQL¶

List of Imports¶

Many of the O*NET dataset tables could be useful. There are too many tables -- and many tables are too large -- for us to be able to perform an exhaustive analysis. We will need to limit our datasets to only those necessary to obtain our end goal.

The following datasets seem immediately relevant.

  • education_training_experience
    • this dataset contains information about each occupational type's required education, training, and experience, which is what we desire to analyze
  • ete_categories
    • this dataset provides decoding information about the categories in education, training, and experience
  • occupation_data
    • this dataset provides decoding information about the occupations
  • content_model_reference
    • this dataset provides information about the types of content skills that are needed for a particular occupation
  • scales_reference
    • this dataset provides decoding information about scales

Each of the above datasets will need to be wrangled.

Once wrangled, we will create join statements that link them together on the occupation codes, and we will likewise join them to the bls dataset wage info.

Import All Datasets¶

In [6]:
# Import each dataset via sqlalchemy into a pandas df
# Adapted from Google search, 'pandas how to read in a table sqlalchemy postgresql'

ete = pd.DataFrame()

with engine.connect() as conn:
    ete = pd.read_sql_table('education_training_experience', schema='onet', con=engine)
In [7]:
ete_categories = pd.DataFrame()

with engine.connect() as conn:
    ete_categories = pd.read_sql_table('ete_categories', schema='onet', con=engine)
In [8]:
od = pd.DataFrame()

with engine.connect() as conn:
    od = pd.read_sql_table('occupation_data', schema='onet', con=engine)
In [9]:
cmr = pd.DataFrame()

with engine.connect() as conn:
    cmr = pd.read_sql_table('content_model_reference', schema='onet', con=engine)
In [10]:
sr = pd.DataFrame()

with engine.connect() as conn:
    sr = pd.read_sql_table('scales_reference', schema='onet', con=engine)

ete: Education, Training, Experience¶

The ete dataset is our primary target.

We begin by observing each column, looking for NaN values, duplicates, etc.

In [11]:
ete.head()
Out[11]:
onetsoc_code element_id scale_id category data_value n standard_error lower_ci_bound upper_ci_bound recommend_suppress date_updated domain_source
0 11-1011.00 2.D.1 RL 1.0 0.00 28.0 0.0000 NaN NaN N 2023-08-01 Incumbent
1 11-1011.00 2.D.1 RL 2.0 4.46 28.0 4.1428 0.6307 25.5524 N 2023-08-01 Incumbent
2 11-1011.00 2.D.1 RL 3.0 0.00 28.0 0.0000 NaN NaN N 2023-08-01 Incumbent
3 11-1011.00 2.D.1 RL 4.0 0.00 28.0 0.0000 NaN NaN N 2023-08-01 Incumbent
4 11-1011.00 2.D.1 RL 5.0 5.15 28.0 5.2236 0.6000 32.7756 N 2023-08-01 Incumbent
In [12]:
ete.duplicated().sum()
Out[12]:
np.int64(0)

There appear to be no straight duplicates.

In [13]:
ete.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37125 entries, 0 to 37124
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   onetsoc_code        37125 non-null  object        
 1   element_id          37125 non-null  object        
 2   scale_id            37125 non-null  object        
 3   category            35998 non-null  float64       
 4   data_value          37125 non-null  float64       
 5   n                   37125 non-null  float64       
 6   standard_error      28115 non-null  float64       
 7   lower_ci_bound      17736 non-null  float64       
 8   upper_ci_bound      17736 non-null  float64       
 9   recommend_suppress  28115 non-null  object        
 10  date_updated        37125 non-null  datetime64[ns]
 11  domain_source       37125 non-null  object        
dtypes: datetime64[ns](1), float64(6), object(5)
memory usage: 3.4+ MB
In [14]:
ete.shape
Out[14]:
(37125, 12)

onetsoc_code¶

In [15]:
ete['onetsoc_code'].describe()
Out[15]:
count          37125
unique           878
top       11-1011.00
freq              43
Name: onetsoc_code, dtype: object

From the website:

onetsoc_code	Character(10)	O*NET-SOC Code (see Occupation Data)

Joinable to the occupation_data table.

In [16]:
ete['onetsoc_code']
Out[16]:
0        11-1011.00
1        11-1011.00
2        11-1011.00
3        11-1011.00
4        11-1011.00
            ...    
37120    53-7121.00
37121    53-7121.00
37122    53-7121.00
37123    53-7121.00
37124    53-7121.00
Name: onetsoc_code, Length: 37125, dtype: object
In [17]:
ete['onetsoc_code'].isna().sum()
Out[17]:
np.int64(0)

I performed a quick visual inspection of the first 1000 rows and saw nothing else that needs wrangling.

I'm confused about how the .00 etc. part will join with the occupation codes in the BLS table. The first six digits make sense, just not these last two.

element_id¶

This column is joinable on the content_model_reference table. The latter table provides the skill that is associated with the value found in the element_id column.

In [18]:
ete['element_id']
Out[18]:
0        2.D.1
1        2.D.1
2        2.D.1
3        2.D.1
4        2.D.1
         ...  
37120    3.A.3
37121    3.A.3
37122    3.A.3
37123    3.A.3
37124    3.A.3
Name: element_id, Length: 37125, dtype: object
In [19]:
ete['element_id'].describe()
Out[19]:
count     37125
unique        6
top       2.D.1
freq      10536
Name: element_id, dtype: object
In [20]:
ete['element_id'].isna().sum()
Out[20]:
np.int64(0)

scale_id¶

This joins on the scales_reference table. Provides a level for each of the categores in the scales_reference.

In [21]:
ete['scale_id'].describe()
Out[21]:
count     37125
unique        5
top          RL
freq      10536
Name: scale_id, dtype: object
In [22]:
ete['scale_id'].value_counts()
Out[22]:
scale_id
RL    10536
RW     9658
PT     7902
OJ     7902
IM     1127
Name: count, dtype: int64
In [23]:
ete['scale_id'].isna().sum()
Out[23]:
np.int64(0)

category¶

This joins on the ete_categories table.

In [24]:
ete['category']
Out[24]:
0        1.0
1        2.0
2        3.0
3        4.0
4        5.0
        ... 
37120    5.0
37121    6.0
37122    7.0
37123    8.0
37124    9.0
Name: category, Length: 37125, dtype: float64
In [25]:
ete['category'].describe()
Out[25]:
count    35998.000000
mean         5.707317
std          3.086156
min          1.000000
25%          3.000000
50%          6.000000
75%          8.000000
max         12.000000
Name: category, dtype: float64
In [26]:
ete['category'].isna().sum()
Out[26]:
np.int64(1127)

These null values are significant for our purposes.

Without a category, these rows may not have useful data for our study.

In [27]:
ete[ete['category'].isna() == True]
Out[27]:
onetsoc_code element_id scale_id category data_value n standard_error lower_ci_bound upper_ci_bound recommend_suppress date_updated domain_source
12 11-1011.00 2.D.4.a IM NaN 3.01 28.0 0.2773 2.4454 3.5834 N 2023-08-01 Incumbent
42 11-1011.00 3.A.4.a IM NaN 2.20 27.0 0.3437 1.4926 2.9055 N 2023-08-01 Incumbent
55 11-1011.03 2.D.4.a IM NaN 2.70 27.0 NaN NaN NaN None 2021-08-01 Occupational Expert
85 11-1011.03 3.A.4.a IM NaN 1.81 27.0 NaN NaN NaN None 2021-08-01 Occupational Expert
98 11-1021.00 2.D.4.a IM NaN 1.88 28.0 0.2508 1.3704 2.3996 N 2023-08-01 Incumbent
... ... ... ... ... ... ... ... ... ... ... ... ...
36915 53-7065.00 3.A.4.a IM NaN 1.26 61.0 0.1361 1.0000 1.5365 N 2022-08-01 Incumbent
36969 53-7072.00 2.D.4.a IM NaN 1.81 19.0 0.4649 1.0000 2.7914 N 2021-08-01 Incumbent
36999 53-7072.00 3.A.4.a IM NaN 1.97 19.0 0.2854 1.3720 2.5714 N 2021-08-01 Incumbent
37053 53-7081.00 2.D.4.a IM NaN 3.07 20.0 0.2438 2.5557 3.5762 N 2023-08-01 Incumbent
37083 53-7081.00 3.A.4.a IM NaN 2.06 20.0 0.1651 1.7117 2.4028 N 2023-08-01 Incumbent

1127 rows × 12 columns

We may drop these at a later time. The reason why these cells are blank is not yet clear.

data_value¶

From the website:

Rating associated with the O*NET-SOC occupation

An AI Overview provided by Google states:

The O*NET-SOC data_value rating indicates the importance or level of specific occupational descriptors (such as skills, abilities, or work context) for a particular job, generally ranging from "Not Important" (1) to "Extremely Important" (5). These ratings, often provided as a mean extent (EX) or content (CX) value, help quantify the requirements and characteristics of over 1,000 occupations. 
O*NET Resource Center
O*NET Resource Center
 +3
Key details regarding O*NET-SOC ratings include:
Scale IDs (Scale ID): Ratings are categorized by scale types. EX indicates a Mean Extent rating (degree of importance), while CX denotes a Content Context rating.
Rating Significance (data_value): These values show how necessary a skill or activity is, typically on a scale from 1 (lowest) to 5 (highest).
Data Source: These ratings are derived from surveys of job incumbents and occupational experts.
Application: These metrics allow for detailed comparisons between occupations in the O*NET Database.

In this context, the values are shown (below) to range from 1 to 100, as opposed to a max of 5.

In [28]:
ete['data_value']
Out[28]:
0         0.00
1         4.46
2         0.00
3         0.00
4         5.15
         ...  
37120    27.84
37121    12.27
37122     1.21
37123     1.57
37124     0.00
Name: data_value, Length: 37125, dtype: float64
In [29]:
ete['data_value'].describe()
Out[29]:
count    37125.000000
mean         9.539695
std         13.901753
min          0.000000
25%          0.000000
50%          3.570000
75%         14.290000
max        100.000000
Name: data_value, dtype: float64
In [30]:
ete['data_value'].isna().sum()
Out[30]:
np.int64(0)

n¶

Sample size for each row/occupation category.

In [31]:
ete['n']
Out[31]:
0        28.0
1        28.0
2        28.0
3        28.0
4        28.0
         ... 
37120    27.0
37121    27.0
37122    27.0
37123    27.0
37124    27.0
Name: n, Length: 37125, dtype: float64
In [32]:
ete['n'].isna().sum()
Out[32]:
np.int64(0)
In [33]:
ete['n'].describe()
Out[33]:
count    37125.000000
mean        25.171152
std          9.474034
min          9.000000
25%         20.000000
50%         23.000000
75%         28.000000
max         98.000000
Name: n, dtype: float64

standard_error¶

In [34]:
ete['standard_error'].describe()
Out[34]:
count    28115.000000
mean         4.704188
std          5.432629
min          0.000000
25%          0.000000
50%          2.311300
75%          8.669050
max         29.734500
Name: standard_error, dtype: float64

lower_ci_bound¶

In [35]:
ete['lower_ci_bound'].describe()
Out[35]:
count    17736.000000
mean         5.344971
std          8.259175
min          0.001200
25%          0.720125
50%          2.181450
75%          6.349525
max         88.527500
Name: lower_ci_bound, dtype: float64

upper_ci_bound¶

In [36]:
ete['upper_ci_bound'].describe()
Out[36]:
count    17736.000000
mean        38.158627
std         22.817287
min          0.103800
25%         19.622000
50%         38.295200
75%         54.458500
max         99.642700
Name: upper_ci_bound, dtype: float64

recommend_suppress¶

Low precision indicator (Y=yes, N=no)

For those that state Y, they should probably be dropped from the dataset.

In [37]:
ete['recommend_suppress'].describe()
Out[37]:
count     28115
unique        2
top           N
freq      27249
Name: recommend_suppress, dtype: object
In [38]:
ete['recommend_suppress'].value_counts()
Out[38]:
recommend_suppress
N    27249
Y      866
Name: count, dtype: int64
In [39]:
ete['recommend_suppress'].isna().sum()
Out[39]:
np.int64(9010)

date_updated¶

This data may be of little value for our purposes.

In [40]:
ete['date_updated'].describe()
Out[40]:
count                            37125
mean     2020-04-11 18:41:09.818181632
min                2004-12-01 00:00:00
25%                2017-07-01 00:00:00
50%                2020-08-01 00:00:00
75%                2023-08-01 00:00:00
max                2025-08-01 00:00:00
Name: date_updated, dtype: object

domain_source¶

This data may have little value for our purposes.

In [41]:
ete['domain_source'].describe()
Out[41]:
count         37125
unique            2
top       Incumbent
freq          28115
Name: domain_source, dtype: object
In [42]:
ete['domain_source'].value_counts()
Out[42]:
domain_source
Incumbent              28115
Occupational Expert     9010
Name: count, dtype: int64

ete_categories¶

In [43]:
ete_categories.head()
Out[43]:
element_id scale_id category category_description
0 2.D.1 RL 1.0 Less than a High School Diploma
1 2.D.1 RL 2.0 High School Diploma - or the equivalent (for e...
2 2.D.1 RL 3.0 Post-Secondary Certificate - awarded for train...
3 2.D.1 RL 4.0 Some College Courses
4 2.D.1 RL 5.0 Associate's Degree (or other 2-year degree)
In [44]:
ete_categories.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41 entries, 0 to 40
Data columns (total 4 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   element_id            41 non-null     object 
 1   scale_id              41 non-null     object 
 2   category              41 non-null     float64
 3   category_description  41 non-null     object 
dtypes: float64(1), object(3)
memory usage: 1.4+ KB
In [45]:
ete_categories.duplicated().sum()
Out[45]:
np.int64(0)

Observations¶

This dataset is simply a decoder and needs little wrangling. There are no null values or duplicates.

occupation_data¶

In [46]:
od.head()
Out[46]:
onetsoc_code title description
0 11-1011.00 Chief Executives Determine and formulate policies and provide o...
1 11-1011.03 Chief Sustainability Officers Communicate and coordinate with management, sh...
2 11-1021.00 General and Operations Managers Plan, direct, or coordinate the operations of ...
3 11-1031.00 Legislators Develop, introduce, or enact laws and statutes...
4 11-2011.00 Advertising and Promotions Managers Plan, direct, or coordinate advertising polici...
In [47]:
od.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1016 entries, 0 to 1015
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   onetsoc_code  1016 non-null   object
 1   title         1016 non-null   object
 2   description   1016 non-null   object
dtypes: object(3)
memory usage: 23.9+ KB

Observations¶

Once again, this dataset is a decoder, providing meaning to onetsoc_code codes such as 11-1011.00, and does not need wrangling.

content_model_reference¶

In [48]:
cmr.head()
Out[48]:
element_id element_name description
0 1 Worker Characteristics Worker Characteristics
1 1.A Abilities Enduring attributes of the individual that inf...
2 1.A.1 Cognitive Abilities Abilities that influence the acquisition and a...
3 1.A.1.a Verbal Abilities Abilities that influence the acquisition and a...
4 1.A.1.a.1 Oral Comprehension The ability to listen to and understand inform...

Observations¶

This is another decoder dataset, providing meaning to element_id codes such as 1.A.1.a.1. The cmr dataset does not need wrangling.

scales_reference¶

In [49]:
sr.head()
Out[49]:
scale_id scale_name minimum maximum
0 AO Automation 1.0 5.0
1 CF Frequency 1.0 5.0
2 CN Amount of Contact 1.0 5.0
3 CT Context 1.0 3.0
4 CTP Context (Categories 1-3) 0.0 100.0

The scales_reference dataset is yet another decoder, providing meaning to the scale_id codes in the ete dataset.

Final Observations¶

The O*NET datsets appear to be much clearer at this stage than the BLS datset.

We did identify 1127 rows in ete['category'] that are null. Because the category column is paramount for our question about Education, Training, and Experience in comparison with the financial rewards found in the BLS dataset, we likely should drop these rows.

However, before we do drop these rows, we should wait until we have assembled and begun the data comparison project. There is a possibility that these rows contain information for rows that are related via other symbols, such as scale_id. Once we are certain these rows are not needed, we will go ahead and drop them.

In [ ]: