About This notebook¶

This is notebook imports BLS Wage datasets and prepares them for insertion into a postgresql database for further analysis.

When completed, the idea is that this notebook is run once, transferring the data in as raw and pristine of a fashion as possible to the postgresql database.

If the data in the database is found to be unsatisfactory, the associated database tables would need to be dropped, this file would need to be adapted as necessary, and then run again.

Link to Data¶

https://www.bls.gov/oes/

Project Setup¶

Import and Verifications¶

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import sqlalchemy
In [2]:
print(sqlalchemy.__version__)
2.0.48
In [400]:
from sqlalchemy import create_engine, inspect
from sqlalchemy import text
from sqlalchemy.schema import CreateSchema
from sqlalchemy import BigInteger, Column, Integer, String, Float, Boolean
from sqlalchemy.orm import DeclarativeBase, Mapped, mapped_column

Create SQL Engine and Test Connection¶

In [4]:
engine = create_engine('postgresql+psycopg2://postgres@localhost/train_reward_compare')
In [5]:
with engine.connect() as conn:
    result = conn.execute(text("select count(*) from onet.occupation_data"))
    print(result.fetchone())
(1016,)

Import BLS Files¶

In [6]:
file_nat = '../../raw-data/bls-wage/oesm24nat/national_M2024_dl.xlsx'
file_sta = '../../raw-data/bls-wage/oesm24st/state_M2024_dl.xlsx'
file_met = '../../raw-data/bls-wage/oesm24ma/MSA_M2024_dl.xlsx'
file_bos = '../../raw-data/bls-wage/oesm24ma/BOS_M2024_dl.xlsx'
file_nbs = '../../raw-data/bls-wage/oesm24in4/natsector_M2024_dl.xlsx'
In [7]:
nat = pd.ExcelFile(file_nat)
sta = pd.ExcelFile(file_sta)
met = pd.ExcelFile(file_met)
bos = pd.ExcelFile(file_bos)
nbs = pd.ExcelFile(file_nbs)

Test Data in Each Imported File¶

In [8]:
print(nat.sheet_names)
print(sta.sheet_names)
print(met.sheet_names)
print(bos.sheet_names)
print(nbs.sheet_names)
['national_M2024_dl', 'Field Descriptions', 'UpdateTime', 'Filler']
['state_M2024_dl', 'Field Descriptions', 'UpdateTime', 'Filler']
['MSA_M2024_dl', 'Field Descriptions', 'UpdateTime', 'Filler']
['BOS_M2024_dl', 'Field Descriptions', 'UpdateTime', 'Filler']
['Natsector_M2024_dl', 'Field Descriptions', 'UpdateTime', 'Filler']

Rough Wrangling¶

National Data¶

Rough Wrangle National BLS Data¶

In [9]:
nat.parse('national_M2024_dl').head()
Out[9]:
AREA AREA_TITLE AREA_TYPE PRIM_STATE NAICS NAICS_TITLE I_GROUP OWN_CODE OCC_CODE OCC_TITLE ... H_MEDIAN H_PCT75 H_PCT90 A_PCT10 A_PCT25 A_MEDIAN A_PCT75 A_PCT90 ANNUAL HOURLY
0 99 U.S. 1 US 0 Cross-industry cross-industry 1235 00-0000 All Occupations ... 23.8 37.89 60.44 29990 36730 49500 78810 125720 NaN NaN
1 99 U.S. 1 US 0 Cross-industry cross-industry 1235 11-0000 Management Occupations ... 58.7 82.5 # 57010 79900 122090 171610 # NaN NaN
2 99 U.S. 1 US 0 Cross-industry cross-industry 1235 11-1000 Top Executives ... 50.48 81.01 # 47510 68800 104990 168490 # NaN NaN
3 99 U.S. 1 US 0 Cross-industry cross-industry 1235 11-1010 Chief Executives ... 99.24 # # 73710 126080 206420 # # NaN NaN
4 99 U.S. 1 US 0 Cross-industry cross-industry 1235 11-1011 Chief Executives ... 99.24 # # 73710 126080 206420 # # NaN NaN

5 rows × 32 columns

In [10]:
nat_data = nat.parse('national_M2024_dl')
In [11]:
nat_data.columns = nat_data.columns.str.lower()
In [12]:
nat_data.head()
Out[12]:
area area_title area_type prim_state naics naics_title i_group own_code occ_code occ_title ... h_median h_pct75 h_pct90 a_pct10 a_pct25 a_median a_pct75 a_pct90 annual hourly
0 99 U.S. 1 US 0 Cross-industry cross-industry 1235 00-0000 All Occupations ... 23.8 37.89 60.44 29990 36730 49500 78810 125720 NaN NaN
1 99 U.S. 1 US 0 Cross-industry cross-industry 1235 11-0000 Management Occupations ... 58.7 82.5 # 57010 79900 122090 171610 # NaN NaN
2 99 U.S. 1 US 0 Cross-industry cross-industry 1235 11-1000 Top Executives ... 50.48 81.01 # 47510 68800 104990 168490 # NaN NaN
3 99 U.S. 1 US 0 Cross-industry cross-industry 1235 11-1010 Chief Executives ... 99.24 # # 73710 126080 206420 # # NaN NaN
4 99 U.S. 1 US 0 Cross-industry cross-industry 1235 11-1011 Chief Executives ... 99.24 # # 73710 126080 206420 # # NaN NaN

5 rows × 32 columns

Rough Wrangle National Field Descriptions¶

In [13]:
nat_desc_pre = nat.parse('Field Descriptions')
In [14]:
nat_desc_pre.iloc[:15,:]
Out[14]:
May 2024 OEWS Estimates Unnamed: 1 Unnamed: 2
0 NaN NaN NaN
1 Occupational Employment and Wage Statistics (O... NaN NaN
2 Bureau of Labor Statistics, Department of Labor NaN NaN
3 website: www.bls.gov/oes NaN NaN
4 email: oewsinfo@bls.gov NaN NaN
5 NaN NaN NaN
6 Not all fields are available for every type of... NaN NaN
7 NaN NaN NaN
8 Field Field Description NaN
9 area U.S. (99), state FIPS code, Metropolitan Stati... NaN
10 area_title Area name NaN
11 area_type Area type: 1= U.S.; 2= State; 3= U.S. Territor... NaN
12 prim_state The primary state for the given area. "US" is ... NaN
13 naics North American Industry Classification System ... NaN
14 naics_title North American Industry Classification System ... NaN
In [15]:
nat_desc_pre = nat_desc_pre.drop('Unnamed: 2', axis=1)
In [16]:
drop_index = nat_desc_pre.index[0:8]
In [17]:
nat_desc_pre_dropped = nat_desc_pre.drop(drop_index)
In [18]:
nat_desc_pre_dropped.head()
Out[18]:
May 2024 OEWS Estimates Unnamed: 1
8 Field Field Description
9 area U.S. (99), state FIPS code, Metropolitan Stati...
10 area_title Area name
11 area_type Area type: 1= U.S.; 2= State; 3= U.S. Territor...
12 prim_state The primary state for the given area. "US" is ...
In [19]:
nat_desc_pre_new_header = nat_desc_pre_dropped.iloc[0]
In [20]:
nat_desc_pre_new = nat_desc_pre_dropped[1:]
In [21]:
nat_desc_pre_new.columns = nat_desc_pre_new_header
In [22]:
nat_desc_pre_new = nat_desc_pre_new.reset_index(drop=True)
In [23]:
nat_desc_pre_new.index.name = None
In [24]:
nat_desc_pre_new = nat_desc_pre_new.rename_axis(None, axis=1)
In [25]:
nat_desc_pre_new.head()
Out[25]:
Field Field Description
0 area U.S. (99), state FIPS code, Metropolitan Stati...
1 area_title Area name
2 area_type Area type: 1= U.S.; 2= State; 3= U.S. Territor...
3 prim_state The primary state for the given area. "US" is ...
4 naics North American Industry Classification System ...
In [26]:
nat_desc_pre_new.tail(10)
Out[26]:
Field Field Description
28 a_pct75 Annual 75th percentile wage
29 a_pct90 Annual 90th percentile wage
30 annual Contains "TRUE" if only annual wages are relea...
31 hourly Contains "TRUE" if only hourly wages are relea...
32 NaN NaN
33 Notes: NaN
34 * = indicates that a wage estimate is not ava... NaN
35 ** = indicates that an employment estimate is... NaN
36 # = indicates a wage equal to or greater than... NaN
37 ~ =indicates that the percent of establishment... NaN

Split Descriptions and Notes¶

In [27]:
notes_df = nat_desc_pre_new.iloc[32:,:]
In [28]:
nat_desc = nat_desc_pre_new.iloc[:32, :]
In [29]:
nat_desc.loc[:,'Field'].str.startswith(' ')
Out[29]:
0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21    False
22    False
23    False
24    False
25    False
26    False
27    False
28    False
29    False
30    False
31    False
Name: Field, dtype: bool
In [30]:
nat_desc['Field'].str.endswith(' ')
Out[30]:
0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21    False
22    False
23    False
24    False
25    False
26    False
27    False
28    False
29    False
30     True
31     True
Name: Field, dtype: bool
In [31]:
nat_desc.loc[:,'Field'] = nat_desc.loc[:,'Field'].str.strip()
In [32]:
nat_desc.loc[:,'Field'].str.startswith(' ')
Out[32]:
0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21    False
22    False
23    False
24    False
25    False
26    False
27    False
28    False
29    False
30    False
31    False
Name: Field, dtype: bool
In [33]:
nat_desc['Field'].str.endswith(' ')
Out[33]:
0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21    False
22    False
23    False
24    False
25    False
26    False
27    False
28    False
29    False
30    False
31    False
Name: Field, dtype: bool
In [34]:
nat_desc
Out[34]:
Field Field Description
0 area U.S. (99), state FIPS code, Metropolitan Stati...
1 area_title Area name
2 area_type Area type: 1= U.S.; 2= State; 3= U.S. Territor...
3 prim_state The primary state for the given area. "US" is ...
4 naics North American Industry Classification System ...
5 naics_title North American Industry Classification System ...
6 i_group Industry level. Indicates cross-industry or NA...
7 own_code Ownership type: 1= Federal Government; 2= Stat...
8 occ_code The 6-digit Standard Occupational Classificati...
9 occ_title SOC title or OEWS-specific title for the occup...
10 o_group SOC occupation level. For most occupations, th...
11 tot_emp Estimated total employment rounded to the near...
12 emp_prse Percent relative standard error (PRSE) for the...
13 jobs_1000 The number of jobs (employment) in the given o...
14 loc quotient The location quotient represents the ratio of ...
15 pct_total Percent of industry employment in the given oc...
16 pct_rpt Percent of establishments reporting the given ...
17 h_mean Mean hourly wage
18 a_mean Mean annual wage
19 mean_prse Percent relative standard error (PRSE) for the...
20 h_pct10 Hourly 10th percentile wage
21 h_pct25 Hourly 25th percentile wage
22 h_median Hourly median wage (or the 50th percentile)
23 h_pct75 Hourly 75th percentile wage
24 h_pct90 Hourly 90th percentile wage
25 a_pct10 Annual 10th percentile wage
26 a_pct25 Annual 25th percentile wage
27 a_median Annual median wage (or the 50th percentile)
28 a_pct75 Annual 75th percentile wage
29 a_pct90 Annual 90th percentile wage
30 annual Contains "TRUE" if only annual wages are relea...
31 hourly Contains "TRUE" if only hourly wages are relea...

Storing Notes in DataFrame¶

In [35]:
notes_df = notes_df.drop(columns=['Field Description'], axis=1)
In [36]:
notes_df = notes_df.iloc[2:,:]
In [37]:
notes_df = notes_df.rename(columns={'Field':'Notes'}) 
In [38]:
notes_df
Out[38]:
Notes
34 * = indicates that a wage estimate is not ava...
35 ** = indicates that an employment estimate is...
36 # = indicates a wage equal to or greater than...
37 ~ =indicates that the percent of establishment...

State Data¶

Rough Wrangle State Data¶

In [39]:
sta.parse('state_M2024_dl').head()
Out[39]:
AREA AREA_TITLE AREA_TYPE PRIM_STATE NAICS NAICS_TITLE I_GROUP OWN_CODE OCC_CODE OCC_TITLE ... H_MEDIAN H_PCT75 H_PCT90 A_PCT10 A_PCT25 A_MEDIAN A_PCT75 A_PCT90 ANNUAL HOURLY
0 1 Alabama 2 AL 0 Cross-industry cross-industry 1235 00-0000 All Occupations ... 21.07 30.82 47.51 23520 30660 43830 64110 98810 NaN NaN
1 1 Alabama 2 AL 0 Cross-industry cross-industry 1235 11-0000 Management Occupations ... 48.39 68.5 98.03 51100 72870 100640 142480 203900 NaN NaN
2 1 Alabama 2 AL 0 Cross-industry cross-industry 1235 11-1011 Chief Executives ... 79.04 106.69 # 104950 130950 164400 221910 # NaN NaN
3 1 Alabama 2 AL 0 Cross-industry cross-industry 1235 11-1021 General and Operations Managers ... 51.12 78.26 # 50410 74720 106330 162780 # NaN NaN
4 1 Alabama 2 AL 0 Cross-industry cross-industry 1235 11-1031 Legislators ... * * * 18270 20950 26990 41760 63900 True NaN

5 rows × 32 columns

In [40]:
sta_data = sta.parse('state_M2024_dl')
In [41]:
sta_data.columns = sta_data.columns.str.lower()
In [42]:
sta_data.head()
Out[42]:
area area_title area_type prim_state naics naics_title i_group own_code occ_code occ_title ... h_median h_pct75 h_pct90 a_pct10 a_pct25 a_median a_pct75 a_pct90 annual hourly
0 1 Alabama 2 AL 0 Cross-industry cross-industry 1235 00-0000 All Occupations ... 21.07 30.82 47.51 23520 30660 43830 64110 98810 NaN NaN
1 1 Alabama 2 AL 0 Cross-industry cross-industry 1235 11-0000 Management Occupations ... 48.39 68.5 98.03 51100 72870 100640 142480 203900 NaN NaN
2 1 Alabama 2 AL 0 Cross-industry cross-industry 1235 11-1011 Chief Executives ... 79.04 106.69 # 104950 130950 164400 221910 # NaN NaN
3 1 Alabama 2 AL 0 Cross-industry cross-industry 1235 11-1021 General and Operations Managers ... 51.12 78.26 # 50410 74720 106330 162780 # NaN NaN
4 1 Alabama 2 AL 0 Cross-industry cross-industry 1235 11-1031 Legislators ... * * * 18270 20950 26990 41760 63900 True NaN

5 rows × 32 columns

Rough Wrangle State Field Descriptions¶

In [43]:
sta_desc = sta.parse('Field Descriptions')
In [44]:
sta_desc.iloc[:15,:]
Out[44]:
May 2024 OEWS Estimates Unnamed: 1 Unnamed: 2
0 NaN NaN NaN
1 Occupational Employment and Wage Statistics (O... NaN NaN
2 Bureau of Labor Statistics, Department of Labor NaN NaN
3 website: www.bls.gov/oes NaN NaN
4 email: oewsinfo@bls.gov NaN NaN
5 NaN NaN NaN
6 Not all fields are available for every type of... NaN NaN
7 NaN NaN NaN
8 Field Field Description NaN
9 area U.S. (99), state FIPS code, Metropolitan Stati... NaN
10 area_title Area name NaN
11 area_type Area type: 1= U.S.; 2= State; 3= U.S. Territor... NaN
12 prim_state The primary state for the given area. "US" is ... NaN
13 naics North American Industry Classification System ... NaN
14 naics_title North American Industry Classification System ... NaN
In [45]:
sta_desc = sta_desc.drop('Unnamed: 2', axis=1)
drop_index = sta_desc.index[0:8]
sta_desc_dropped = sta_desc.drop(drop_index)
sta_desc_new_header = sta_desc_dropped.iloc[0]
sta_desc_new = sta_desc_dropped[1:]
sta_desc_new.columns = sta_desc_new_header
sta_desc_new = sta_desc_new.reset_index(drop=True)
sta_desc_new.index.name = None
sta_desc_new = sta_desc_new.rename_axis(None, axis=1)
In [46]:
sta_desc_new.head()
Out[46]:
Field Field Description
0 area U.S. (99), state FIPS code, Metropolitan Stati...
1 area_title Area name
2 area_type Area type: 1= U.S.; 2= State; 3= U.S. Territor...
3 prim_state The primary state for the given area. "US" is ...
4 naics North American Industry Classification System ...

Metropolitan Data¶

Rough Wrangle Metropolitan Data¶

In [47]:
met_data = met.parse('MSA_M2024_dl')
met_data.columns = met_data.columns.str.lower()
In [48]:
met_data.head()
Out[48]:
area area_title area_type prim_state naics naics_title i_group own_code occ_code occ_title ... h_median h_pct75 h_pct90 a_pct10 a_pct25 a_median a_pct75 a_pct90 annual hourly
0 10180 Abilene, TX 4 TX 0 Cross-industry cross-industry 1235 00-0000 All Occupations ... 20.25 29.34 45.3 23120 29640 42120 61020 94220 NaN NaN
1 10180 Abilene, TX 4 TX 0 Cross-industry cross-industry 1235 11-0000 Management Occupations ... 42.03 61.17 84.3 44710 61250 87420 127230 175350 NaN NaN
2 10180 Abilene, TX 4 TX 0 Cross-industry cross-industry 1235 11-1011 Chief Executives ... 85.35 # # 99140 107040 177520 # # NaN NaN
3 10180 Abilene, TX 4 TX 0 Cross-industry cross-industry 1235 11-1021 General and Operations Managers ... 37.18 58.64 87.22 35980 50640 77340 121970 181420 NaN NaN
4 10180 Abilene, TX 4 TX 0 Cross-industry cross-industry 1235 11-2021 Marketing Managers ... 50.02 73.86 99.9 58480 82560 104030 153630 207790 NaN NaN

5 rows × 32 columns

Rough Wrangle Metropolitan Field Descriptions¶

In [49]:
met_desc = met.parse('Field Descriptions')
In [50]:
met_desc.iloc[:15,:]
Out[50]:
May 2024 OEWS Estimates Unnamed: 1 Unnamed: 2
0 NaN NaN NaN
1 Occupational Employment and Wage Statistics (O... NaN NaN
2 Bureau of Labor Statistics, Department of Labor NaN NaN
3 website: www.bls.gov/oes NaN NaN
4 email: oewsinfo@bls.gov NaN NaN
5 NaN NaN NaN
6 Not all fields are available for every type of... NaN NaN
7 NaN NaN NaN
8 Field Field Description NaN
9 area U.S. (99), state FIPS code, Metropolitan Stati... NaN
10 area_title Area name NaN
11 area_type Area type: 1= U.S.; 2= State; 3= U.S. Territor... NaN
12 prim_state The primary state for the given area. "US" is ... NaN
13 naics North American Industry Classification System ... NaN
14 naics_title North American Industry Classification System ... NaN
In [51]:
met_desc = met_desc.drop('Unnamed: 2', axis=1)
drop_index = met_desc.index[0:8]
met_desc_dropped = met_desc.drop(drop_index)
met_desc_new_header = met_desc_dropped.iloc[0]
met_desc_new = met_desc_dropped[1:]
met_desc_new.columns = met_desc_new_header
met_desc_new = met_desc_new.reset_index(drop=True)
met_desc_new.index.name = None
met_desc_new = met_desc_new.rename_axis(None, axis=1)
In [52]:
met_desc_new.head()
Out[52]:
Field Field Description
0 area U.S. (99), state FIPS code, Metropolitan Stati...
1 area_title Area name
2 area_type Area type: 1= U.S.; 2= State; 3= U.S. Territor...
3 prim_state The primary state for the given area. "US" is ...
4 naics North American Industry Classification System ...

Balance of State Data¶

Rough Wrangle Balance of State Data¶

In [53]:
bos_data = bos.parse('BOS_M2024_dl')
bos_data.columns = bos_data.columns.str.lower() 
In [54]:
bos_data.head()
Out[54]:
area area_title area_type prim_state naics naics_title i_group own_code occ_code occ_title ... h_median h_pct75 h_pct90 a_pct10 a_pct25 a_median a_pct75 a_pct90 annual hourly
0 100001 Northwest Alabama nonmetropolitan area 6 AL 0 Cross-industry cross-industry 1235 00-0000 All Occupations ... 18.94 25 35.51 22380 29690 39400 52000 73870 NaN NaN
1 100001 Northwest Alabama nonmetropolitan area 6 AL 0 Cross-industry cross-industry 1235 11-0000 Management Occupations ... 40.54 58.5 80.72 44770 60730 84320 121680 167910 NaN NaN
2 100001 Northwest Alabama nonmetropolitan area 6 AL 0 Cross-industry cross-industry 1235 11-1011 Chief Executives ... 71.58 85.16 # 114610 121930 148890 177130 # NaN NaN
3 100001 Northwest Alabama nonmetropolitan area 6 AL 0 Cross-industry cross-industry 1235 11-1021 General and Operations Managers ... 46.63 73.99 101.78 42680 63380 96980 153900 211700 NaN NaN
4 100001 Northwest Alabama nonmetropolitan area 6 AL 0 Cross-industry cross-industry 1235 11-1031 Legislators ... * * * 18210 22690 51460 52020 52020 True NaN

5 rows × 32 columns

Rough Wrangle Balance of State Field Descriptions¶

In [55]:
bos_desc = bos.parse('Field Descriptions')
In [56]:
bos_desc.iloc[:15,:]
Out[56]:
May 2024 OEWS Estimates Unnamed: 1 Unnamed: 2
0 NaN NaN NaN
1 Occupational Employment and Wage Statistics (O... NaN NaN
2 Bureau of Labor Statistics, Department of Labor NaN NaN
3 website: www.bls.gov/oes NaN NaN
4 email: oewsinfo@bls.gov NaN NaN
5 NaN NaN NaN
6 Not all fields are available for every type of... NaN NaN
7 NaN NaN NaN
8 Field Field Description NaN
9 area U.S. (99), state FIPS code, Metropolitan Stati... NaN
10 area_title Area name NaN
11 area_type Area type: 1= U.S.; 2= State; 3= U.S. Territor... NaN
12 prim_state The primary state for the given area. "US" is ... NaN
13 naics North American Industry Classification System ... NaN
14 naics_title North American Industry Classification System ... NaN
In [57]:
bos_desc = bos_desc.drop('Unnamed: 2', axis=1)
drop_index = bos_desc.index[0:8]
bos_desc_dropped = bos_desc.drop(drop_index)
bos_desc_new_header = bos_desc_dropped.iloc[0]
bos_desc_new = bos_desc_dropped[1:]
bos_desc_new.columns = bos_desc_new_header
bos_desc_new = bos_desc_new.reset_index(drop=True)
bos_desc_new.index.name = None
bos_desc_new = bos_desc_new.rename_axis(None, axis=1)
In [58]:
bos_desc_new.head()
Out[58]:
Field Field Description
0 area U.S. (99), state FIPS code, Metropolitan Stati...
1 area_title Area name
2 area_type Area type: 1= U.S.; 2= State; 3= U.S. Territor...
3 prim_state The primary state for the given area. "US" is ...
4 naics North American Industry Classification System ...

National Sector Data¶

Rough Wrangle National Sector Data¶

In [59]:
nbs_data = nbs.parse('Natsector_M2024_dl')
nbs_data.columns = nbs_data.columns.str.lower() 
In [60]:
nbs_data.head()
Out[60]:
area area_title area_type prim_state naics naics_title i_group own_code occ_code occ_title ... h_median h_pct75 h_pct90 a_pct10 a_pct25 a_median a_pct75 a_pct90 annual hourly
0 99 U.S. 1 US 11 Agriculture, Forestry, Fishing and Hunting sector 5 00-0000 All Occupations ... 17.66 22.21 30.23 33280 34680 36720 46200 62880 NaN NaN
1 99 U.S. 1 US 11 Agriculture, Forestry, Fishing and Hunting sector 5 11-0000 Management Occupations ... 46.08 63.4 90.74 41710 67090 95840 131870 188730 NaN NaN
2 99 U.S. 1 US 11 Agriculture, Forestry, Fishing and Hunting sector 5 11-1000 Top Executives ... 40.91 61.6 91.98 32100 57390 85080 128140 191320 NaN NaN
3 99 U.S. 1 US 11 Agriculture, Forestry, Fishing and Hunting sector 5 11-1010 Chief Executives ... 103.99 # # 105020 154280 216290 # # NaN NaN
4 99 U.S. 1 US 11 Agriculture, Forestry, Fishing and Hunting sector 5 11-1011 Chief Executives ... 103.99 # # 105020 154280 216290 # # NaN NaN

5 rows × 32 columns

Rough Wrangle National Sector Field Descriptions¶

In [61]:
nbs_desc = nbs.parse('Field Descriptions')
In [62]:
nbs_desc.iloc[:15,:]
Out[62]:
May 2024 OEWS Estimates Unnamed: 1 Unnamed: 2
0 NaN NaN NaN
1 Occupational Employment and Wage Statistics (O... NaN NaN
2 Bureau of Labor Statistics, Department of Labor NaN NaN
3 website: www.bls.gov/oes NaN NaN
4 email: oewsinfo@bls.gov NaN NaN
5 NaN NaN NaN
6 Not all fields are available for every type of... NaN NaN
7 NaN NaN NaN
8 Field Field Description NaN
9 area U.S. (99), state FIPS code, Metropolitan Stati... NaN
10 area_title Area name NaN
11 area_type Area type: 1= U.S.; 2= State; 3= U.S. Territor... NaN
12 prim_state The primary state for the given area. "US" is ... NaN
13 naics North American Industry Classification System ... NaN
14 naics_title North American Industry Classification System ... NaN
In [63]:
nbs_desc = nbs_desc.drop('Unnamed: 2', axis=1)
drop_index = nbs_desc.index[0:8]
nbs_desc_dropped = nbs_desc.drop(drop_index)
nbs_desc_new_header = nbs_desc_dropped.iloc[0]
nbs_desc_new = nbs_desc_dropped[1:]
nbs_desc_new.columns = nbs_desc_new_header
nbs_desc_new = nbs_desc_new.reset_index(drop=True)
nbs_desc_new.index.name = None
nbs_desc_new = nbs_desc_new.rename_axis(None, axis=1)
In [64]:
nbs_desc_new.head()
Out[64]:
Field Field Description
0 area U.S. (99), state FIPS code, Metropolitan Stati...
1 area_title Area name
2 area_type Area type: 1= U.S.; 2= State; 3= U.S. Territor...
3 prim_state The primary state for the given area. "US" is ...
4 naics North American Industry Classification System ...

Rough Wrangle Closing Commentary¶

The Field Description dataframes are ready to be pushed to the sql database. I do want them readily available as a part of the long-term project, so that either I or my users can reference them.

The data for each of the five sets, on the other hand, is a mixed bag. I'm seeing NaN values and # values. I do not know what they mean. It's time to start investigating there.

Full Wrangle¶

National Data¶

Basic Inspection¶

In [65]:
nat_data.dtypes
Out[65]:
area              int64
area_title       object
area_type         int64
prim_state       object
naics             int64
naics_title      object
i_group          object
own_code          int64
occ_code         object
occ_title        object
o_group          object
tot_emp           int64
emp_prse        float64
jobs_1000       float64
loc_quotient    float64
pct_total       float64
pct_rpt         float64
h_mean           object
a_mean           object
mean_prse       float64
h_pct10          object
h_pct25          object
h_median         object
h_pct75          object
h_pct90          object
a_pct10          object
a_pct25          object
a_median         object
a_pct75          object
a_pct90          object
annual           object
hourly           object
dtype: object
In [66]:
nat_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1403 entries, 0 to 1402
Data columns (total 32 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   area          1403 non-null   int64  
 1   area_title    1403 non-null   object 
 2   area_type     1403 non-null   int64  
 3   prim_state    1403 non-null   object 
 4   naics         1403 non-null   int64  
 5   naics_title   1403 non-null   object 
 6   i_group       1403 non-null   object 
 7   own_code      1403 non-null   int64  
 8   occ_code      1403 non-null   object 
 9   occ_title     1403 non-null   object 
 10  o_group       1403 non-null   object 
 11  tot_emp       1403 non-null   int64  
 12  emp_prse      1403 non-null   float64
 13  jobs_1000     0 non-null      float64
 14  loc_quotient  0 non-null      float64
 15  pct_total     0 non-null      float64
 16  pct_rpt       0 non-null      float64
 17  h_mean        1403 non-null   object 
 18  a_mean        1403 non-null   object 
 19  mean_prse     1403 non-null   float64
 20  h_pct10       1403 non-null   object 
 21  h_pct25       1403 non-null   object 
 22  h_median      1403 non-null   object 
 23  h_pct75       1403 non-null   object 
 24  h_pct90       1403 non-null   object 
 25  a_pct10       1403 non-null   object 
 26  a_pct25       1403 non-null   object 
 27  a_median      1403 non-null   object 
 28  a_pct75       1403 non-null   object 
 29  a_pct90       1403 non-null   object 
 30  annual        83 non-null     object 
 31  hourly        7 non-null      object 
dtypes: float64(6), int64(5), object(21)
memory usage: 350.9+ KB

Removing Blank Columns¶

We see that there are four entirely blank columns.

Let's check their description.

In [67]:
with pd.option_context('display.max_colwidth', None):
    print(nat_desc.iloc[13:17,:])
           Field  \
13     jobs_1000   
14  loc quotient   
15     pct_total   
16       pct_rpt   

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Field Description  
13                                                                                                                                                                                                                                                                                                                          The number of jobs (employment) in the given occupation per 1,000 jobs in the given area. Only available for the state and MSA estimates; otherwise, this column is blank.   
14  The location quotient represents the ratio of an occupation’s share of employment in a given area to that occupation’s share of employment in the U.S. as a whole. For example, an occupation that makes up 10 percent of employment in a specific metropolitan area compared with 2 percent of U.S. employment would have a location quotient of 5 for the area in question. Only available for the state, metropolitan area, and nonmetropolitan area estimates; otherwise, this column is blank.  
15                                                                                                                                                                                                                                Percent of industry employment in the given occupation. Percents may not sum to 100 because the totals may include data for occupations that could not be published separately. Only available for the national industry estimates; otherwise, this column is blank.   
16                                                                                                                                                                                                                                                                                                                                         Percent of establishments reporting the given occupation for the cell. Only available for the national industry estimates; otherwise, this column is blank.   

Confirm Necessity of Deletion¶

These four columns can be dropped from the nat_data df.

Add To Do List¶

Data¶
  • Drop the four columns that have all blanks
Descriptions¶
  • Drop the rows to match the dropped columns in the data
Remote Columns from National Data¶
In [68]:
drop_nat_data_list = ['jobs_1000', 'loc_quotient', 'pct_total', 'pct_rpt']
In [69]:
nat_data = nat_data.drop(drop_nat_data_list, axis=1)
In [70]:
nat_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1403 entries, 0 to 1402
Data columns (total 28 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   area         1403 non-null   int64  
 1   area_title   1403 non-null   object 
 2   area_type    1403 non-null   int64  
 3   prim_state   1403 non-null   object 
 4   naics        1403 non-null   int64  
 5   naics_title  1403 non-null   object 
 6   i_group      1403 non-null   object 
 7   own_code     1403 non-null   int64  
 8   occ_code     1403 non-null   object 
 9   occ_title    1403 non-null   object 
 10  o_group      1403 non-null   object 
 11  tot_emp      1403 non-null   int64  
 12  emp_prse     1403 non-null   float64
 13  h_mean       1403 non-null   object 
 14  a_mean       1403 non-null   object 
 15  mean_prse    1403 non-null   float64
 16  h_pct10      1403 non-null   object 
 17  h_pct25      1403 non-null   object 
 18  h_median     1403 non-null   object 
 19  h_pct75      1403 non-null   object 
 20  h_pct90      1403 non-null   object 
 21  a_pct10      1403 non-null   object 
 22  a_pct25      1403 non-null   object 
 23  a_median     1403 non-null   object 
 24  a_pct75      1403 non-null   object 
 25  a_pct90      1403 non-null   object 
 26  annual       83 non-null     object 
 27  hourly       7 non-null      object 
dtypes: float64(2), int64(5), object(21)
memory usage: 307.0+ KB
Remote Rows from National Descriptions¶
In [71]:
nat_desc = nat_desc[nat_desc['Field'].isin(drop_nat_data_list) == False]
In [72]:
nat_desc
Out[72]:
Field Field Description
0 area U.S. (99), state FIPS code, Metropolitan Stati...
1 area_title Area name
2 area_type Area type: 1= U.S.; 2= State; 3= U.S. Territor...
3 prim_state The primary state for the given area. "US" is ...
4 naics North American Industry Classification System ...
5 naics_title North American Industry Classification System ...
6 i_group Industry level. Indicates cross-industry or NA...
7 own_code Ownership type: 1= Federal Government; 2= Stat...
8 occ_code The 6-digit Standard Occupational Classificati...
9 occ_title SOC title or OEWS-specific title for the occup...
10 o_group SOC occupation level. For most occupations, th...
11 tot_emp Estimated total employment rounded to the near...
12 emp_prse Percent relative standard error (PRSE) for the...
14 loc quotient The location quotient represents the ratio of ...
17 h_mean Mean hourly wage
18 a_mean Mean annual wage
19 mean_prse Percent relative standard error (PRSE) for the...
20 h_pct10 Hourly 10th percentile wage
21 h_pct25 Hourly 25th percentile wage
22 h_median Hourly median wage (or the 50th percentile)
23 h_pct75 Hourly 75th percentile wage
24 h_pct90 Hourly 90th percentile wage
25 a_pct10 Annual 10th percentile wage
26 a_pct25 Annual 25th percentile wage
27 a_median Annual median wage (or the 50th percentile)
28 a_pct75 Annual 75th percentile wage
29 a_pct90 Annual 90th percentile wage
30 annual Contains "TRUE" if only annual wages are relea...
31 hourly Contains "TRUE" if only hourly wages are relea...

Cross Off To Do List¶

Data¶
  • Drop the four columns that have all blanks
Descriptions¶
  • Drop the rows to match the dropped columns in the data

National Data Area-Related Columns Wrangling¶

Dealing with the first four rows, area, area_title, area_type, prim_state.

The first and third row are of type int, while the second and fourths areobject types, suggesting either strings or even multiple types.

All values are non-null.

We need to make sure that the first and third do not contain values outside the scope outline in the description.

We need to inspect the second and fourth for consistency as well.

In [73]:
nat_data['area'].describe()
Out[73]:
count    1403.0
mean       99.0
std         0.0
min        99.0
25%        99.0
50%        99.0
75%        99.0
max        99.0
Name: area, dtype: float64

That's interesting. Apparently, the entire first row is essentially meaningless.

In [74]:
print(nat_desc.loc[nat_desc['Field'] == 'area', 'Field Description'].values[0])
U.S. (99), state FIPS code, Metropolitan Statistical Area (MSA) code, or OEWS-specific nonmetropolitan area code 

This appears to be a column that only has variance in other datasets. The value of this column in this context is limited, but it may be useful elsewhere. Therefore, we will keep it as is, but perhaps discard it when we begin inference.

In [75]:
nat_data['area_title'].head()
Out[75]:
0    U.S.
1    U.S.
2    U.S.
3    U.S.
4    U.S.
Name: area_title, dtype: object
In [76]:
nat_data['area_title'].unique()
Out[76]:
array(['U.S.'], dtype=object)

The lack of variance in the area_title column follows.

In [77]:
nat_data['area_type'].unique()
Out[77]:
array([1])

The results for area_type are identical.

In [78]:
nat_data['prim_state'].unique()
Out[78]:
array(['US'], dtype=object)

The result of this study is that these columns all simply indicate that they are exactly what the title of the dataset would imply: National.

We will keep these for the PostgreSQL insertion at this time.

Industry Classifications¶

Looking at naics, naics_title, and i_group.

In [79]:
nat_data['naics'].unique()
Out[79]:
array([0])
In [80]:
print(nat_desc.loc[nat_desc['Field'] == 'naics', 'Field Description'].values[0])
North American Industry Classification System (NAICS) code for the given industry 
In [81]:
nat_data['naics_title'].unique()
Out[81]:
array(['Cross-industry'], dtype=object)
In [82]:
print(nat_desc.loc[nat_desc['Field'] == 'naics_title', 'Field Description'].values[0])
North American Industry Classification System (NAICS) title for the given industry 
In [83]:
nat_data['i_group'].unique()
Out[83]:
array(['cross-industry'], dtype=object)
In [84]:
print(nat_desc.loc[nat_desc['Field'] == 'i_group', 'Field Description'].values[0])
Industry level. Indicates cross-industry or NAICS sector, 3-digit, 4-digit, 5-digit, or 6-digit industry. For industries that OEWS no longer publishes at the 4-digit NAICS level, the “4-digit” designation indicates the most detailed industry breakdown available: either a standard NAICS 3-digit industry or an OEWS-specific combination of 4-digit industries. Industries that OEWS has aggregated to the 3-digit NAICS level (for example, NAICS 327000) will appear twice, once with the “3-digit” and once with the “4-digit” designation. 
In [85]:
nat_data['own_code'].unique()	
Out[85]:
array([1235])
In [86]:
print(nat_desc.loc[nat_desc['Field'] == 'own_code', 'Field Description'].values[0])
Ownership type: 1= Federal Government; 2= State Government; 3= Local Government; 123= Federal, State, and Local Government; 235=Private, State, and Local Government; 35 = Private and Local Government; 5= Private; 57=Private, Local Government Gambling Establishments (Sector 71), and Local Government Casino Hotels (Sector 72); 58= Private plus State and Local Government Hospitals; 59= Private and Postal Service; 1235= Federal, State, and Local Government and Private Sector 

These columns follow the same pattern as before in terms of variance.

We will keep the values for the PostgreSQL database, even though they will not have value in isolation from other datasets.

occ_code: Standard Occupation Classifications¶

In [87]:
nat_data['occ_code'].unique()
Out[87]:
array(['00-0000', '11-0000', '11-1000', ..., '53-7121', '53-7190',
       '53-7199'], dtype=object)
In [88]:
print(nat_desc.loc[nat_desc['Field'] == 'occ_code', 'Field Description'].values[0])
The 6-digit Standard Occupational Classification (SOC) code or OEWS-specific code for the occupation 
In [89]:
nat_data['occ_code'].describe()
Out[89]:
count        1403
unique       1396
top       31-1120
freq            2
Name: occ_code, dtype: object
In [90]:
nat_data['occ_code'].info()
<class 'pandas.core.series.Series'>
RangeIndex: 1403 entries, 0 to 1402
Series name: occ_code
Non-Null Count  Dtype 
--------------  ----- 
1403 non-null   object
dtypes: object(1)
memory usage: 11.1+ KB
In [91]:
print(type(nat_data['occ_code'][0]))
<class 'str'>
In [92]:
counts = nat_data['occ_code'].value_counts()
In [93]:
counts
Out[93]:
occ_code
31-1120    2
47-4090    2
29-2010    2
51-2090    2
39-7010    2
          ..
27-2012    1
27-2011    1
27-2010    1
27-2000    1
53-7199    1
Name: count, Length: 1396, dtype: int64
In [94]:
threshold = 1
In [95]:
counts_thresh = counts[counts > threshold].index
In [96]:
counts_thresh
Out[96]:
Index(['31-1120', '47-4090', '29-2010', '51-2090', '39-7010', '13-1020',
       '13-2020'],
      dtype='object', name='occ_code')

It seems that there are a few areas that are duplicated.

In [97]:
nat_data[nat_data.groupby('occ_code')['occ_code'].transform('count') > threshold]
Out[97]:
area area_title area_type prim_state naics naics_title i_group own_code occ_code occ_title ... h_median h_pct75 h_pct90 a_pct10 a_pct25 a_median a_pct75 a_pct90 annual hourly
78 99 U.S. 1 US 0 Cross-industry cross-industry 1235 13-1020 Buyers and Purchasing Agents ... 36.37 47.69 61.31 46460 58670 75650 99190 127520 NaN NaN
79 99 U.S. 1 US 0 Cross-industry cross-industry 1235 13-1020 Buyers and Purchasing Agents ... 36.37 47.69 61.31 46460 58670 75650 99190 127520 NaN NaN
111 99 U.S. 1 US 0 Cross-industry cross-industry 1235 13-2020 Property Appraisers and Assessors ... 31.45 43.66 59.02 38480 49310 65420 90810 122760 NaN NaN
112 99 U.S. 1 US 0 Cross-industry cross-industry 1235 13-2020 Property Appraisers and Assessors ... 31.45 43.66 59.02 38480 49310 65420 90810 122760 NaN NaN
573 99 U.S. 1 US 0 Cross-industry cross-industry 1235 29-2010 Clinical Laboratory Technologists and Technicians ... 29.75 38.46 47.11 38020 46580 61890 80010 97990 NaN NaN
574 99 U.S. 1 US 0 Cross-industry cross-industry 1235 29-2010 Clinical Laboratory Technologists and Technicians ... 29.75 38.46 47.11 38020 46580 61890 80010 97990 NaN NaN
612 99 U.S. 1 US 0 Cross-industry cross-industry 1235 31-1120 Home Health and Personal Care Aides ... 16.78 18.26 21.25 25600 30370 34900 37980 44190 NaN NaN
613 99 U.S. 1 US 0 Cross-industry cross-industry 1235 31-1120 Home Health and Personal Care Aides ... 16.78 18.26 21.25 25600 30370 34900 37980 44190 NaN NaN
779 99 U.S. 1 US 0 Cross-industry cross-industry 1235 39-7010 Tour and Travel Guides ... 17.63 22.07 28.81 26890 31250 36660 45910 59930 NaN NaN
780 99 U.S. 1 US 0 Cross-industry cross-industry 1235 39-7010 Tour and Travel Guides ... 17.63 22.07 28.81 26890 31250 36660 45910 59930 NaN NaN
1045 99 U.S. 1 US 0 Cross-industry cross-industry 1235 47-4090 Miscellaneous Construction and Related Workers ... 23.14 29.31 37.28 35160 39990 48120 60960 77540 NaN NaN
1046 99 U.S. 1 US 0 Cross-industry cross-industry 1235 47-4090 Miscellaneous Construction and Related Workers ... 23.14 29.31 37.28 35160 39990 48120 60960 77540 NaN NaN
1163 99 U.S. 1 US 0 Cross-industry cross-industry 1235 51-2090 Miscellaneous Assemblers and Fabricators ... 20.29 23.76 29.82 31650 36660 42210 49410 62030 NaN NaN
1164 99 U.S. 1 US 0 Cross-industry cross-industry 1235 51-2090 Miscellaneous Assemblers and Fabricators ... 20.29 23.76 29.82 31650 36660 42210 49410 62030 NaN NaN

14 rows × 28 columns

We appear to have found fourteen duplicate rows. These must be cleaned.

Add To Do Item¶

  • Clean data duplicates

Before cleaning them, let's take a moment to understand by looking at the SOC User Guide.

The 2000 SOC classifies workers at four levels of aggregation: 1) major group; 2) minor group; 3) broad occupation; and 4) detailed occupation. All occupations are clustered into one of the following 23 major groups:

Within these major groups are 96 minor groups, 449 broad occupations, and 821 detailed occupations. Occupations with similar skills or work activities are grouped at each of the four levels of hierarchy to facilitate comparisons. For example, "Life, Physical and Social Science Occupations" (19-0000) is divided into four minor groups, "Life Scientists" (19-1000), "Physical Scientists" (19-2000), "Social Scientists and Related Workers" (19-3000), and "Life, Physical and Social Science Technicians" (19-4000). Life Scientists contains broad occupations such as "Agriculture and Food Scientists" (19-1010), and "Biological Scientists" (19-1020). The broad occupation Biological Scientists includes detailed occupations such as "Biochemists and Biophysicists" (19-1021), and "Microbiologists" (19-1022).

How many o_group values do we have before we begin cleaning?

In [98]:
nat_data['o_group'].value_counts()
Out[98]:
o_group
detailed    831
broad       455
minor        94
major        22
total         1
Name: count, dtype: int64

We appear to have less of each group than the number reported in the user guide.

Since this dataset has not yet dropped any rows, we can assume that the dataset originated in this form.

Clean Duplicate Rows¶

In [99]:
nat_data = nat_data.drop_duplicates()
In [100]:
counts = nat_data['occ_code'].value_counts()
In [101]:
counts
Out[101]:
occ_code
31-1120    2
47-4090    2
29-2010    2
51-2090    2
39-7010    2
          ..
27-2012    1
27-2011    1
27-2010    1
27-2000    1
53-7199    1
Name: count, Length: 1396, dtype: int64

They don't seem to be dropped. Perhaps there are hidden rows that are different?

In [102]:
with pd.option_context('display.max_columns', None):
    print(nat_data[nat_data.groupby('occ_code')['occ_code'].transform('count') > threshold])
      area area_title  area_type prim_state  naics     naics_title  \
78      99       U.S.          1         US      0  Cross-industry   
79      99       U.S.          1         US      0  Cross-industry   
111     99       U.S.          1         US      0  Cross-industry   
112     99       U.S.          1         US      0  Cross-industry   
573     99       U.S.          1         US      0  Cross-industry   
574     99       U.S.          1         US      0  Cross-industry   
612     99       U.S.          1         US      0  Cross-industry   
613     99       U.S.          1         US      0  Cross-industry   
779     99       U.S.          1         US      0  Cross-industry   
780     99       U.S.          1         US      0  Cross-industry   
1045    99       U.S.          1         US      0  Cross-industry   
1046    99       U.S.          1         US      0  Cross-industry   
1163    99       U.S.          1         US      0  Cross-industry   
1164    99       U.S.          1         US      0  Cross-industry   

             i_group  own_code occ_code  \
78    cross-industry      1235  13-1020   
79    cross-industry      1235  13-1020   
111   cross-industry      1235  13-2020   
112   cross-industry      1235  13-2020   
573   cross-industry      1235  29-2010   
574   cross-industry      1235  29-2010   
612   cross-industry      1235  31-1120   
613   cross-industry      1235  31-1120   
779   cross-industry      1235  39-7010   
780   cross-industry      1235  39-7010   
1045  cross-industry      1235  47-4090   
1046  cross-industry      1235  47-4090   
1163  cross-industry      1235  51-2090   
1164  cross-industry      1235  51-2090   

                                              occ_title   o_group  tot_emp  \
78                         Buyers and Purchasing Agents     broad   486900   
79                         Buyers and Purchasing Agents  detailed   486900   
111                   Property Appraisers and Assessors     broad    59070   
112                   Property Appraisers and Assessors  detailed    59070   
573   Clinical Laboratory Technologists and Technicians     broad   343040   
574   Clinical Laboratory Technologists and Technicians  detailed   343040   
612                 Home Health and Personal Care Aides     broad  3988140   
613                 Home Health and Personal Care Aides  detailed  3988140   
779                              Tour and Travel Guides     broad    49010   
780                              Tour and Travel Guides  detailed    49010   
1045     Miscellaneous Construction and Related Workers     broad    33530   
1046     Miscellaneous Construction and Related Workers  detailed    33530   
1163           Miscellaneous Assemblers and Fabricators     broad  1457800   
1164           Miscellaneous Assemblers and Fabricators  detailed  1457800   

      emp_prse h_mean a_mean  mean_prse h_pct10 h_pct25 h_median h_pct75  \
78         0.6  39.29  81720        0.4   22.34   28.21    36.37   47.69   
79         0.6  39.29  81720        0.4   22.34   28.21    36.37   47.69   
111        2.7   37.1  77160        2.5    18.5   23.71    31.45   43.66   
112        2.7   37.1  77160        2.5    18.5   23.71    31.45   43.66   
573        1.0  31.41  65320        0.3   18.28   22.39    29.75   38.46   
574        1.0  31.41  65320        0.3   18.28   22.39    29.75   38.46   
612        0.3  16.82  34990        0.3   12.31    14.6    16.78   18.26   
613        0.3  16.82  34990        0.3   12.31    14.6    16.78   18.26   
779        2.7  20.72  43090        3.3   12.93   15.03    17.63   22.07   
780        2.7  20.72  43090        3.3   12.93   15.03    17.63   22.07   
1045       4.8  25.44  52910        1.9    16.9   19.23    23.14   29.31   
1046       4.8  25.44  52910        1.9    16.9   19.23    23.14   29.31   
1163       0.5  21.57  44860        0.2   15.22   17.62    20.29   23.76   
1164       0.5  21.57  44860        0.2   15.22   17.62    20.29   23.76   

     h_pct90 a_pct10 a_pct25 a_median a_pct75 a_pct90 annual hourly  
78     61.31   46460   58670    75650   99190  127520    NaN    NaN  
79     61.31   46460   58670    75650   99190  127520    NaN    NaN  
111    59.02   38480   49310    65420   90810  122760    NaN    NaN  
112    59.02   38480   49310    65420   90810  122760    NaN    NaN  
573    47.11   38020   46580    61890   80010   97990    NaN    NaN  
574    47.11   38020   46580    61890   80010   97990    NaN    NaN  
612    21.25   25600   30370    34900   37980   44190    NaN    NaN  
613    21.25   25600   30370    34900   37980   44190    NaN    NaN  
779    28.81   26890   31250    36660   45910   59930    NaN    NaN  
780    28.81   26890   31250    36660   45910   59930    NaN    NaN  
1045   37.28   35160   39990    48120   60960   77540    NaN    NaN  
1046   37.28   35160   39990    48120   60960   77540    NaN    NaN  
1163   29.82   31650   36660    42210   49410   62030    NaN    NaN  
1164   29.82   31650   36660    42210   49410   62030    NaN    NaN  

This shows us that the o_group column is providing some type of difference in these columns. Let's explore this column's description.

In [103]:
print(nat_desc.loc[nat_desc['Field'] == 'o_group', 'Field Description'].values[0])
SOC occupation level. For most occupations, this field indicates the standard SOC major, minor, broad, and detailed levels, in addition to all-occupations totals. For occupations that OEWS no longer publishes at the SOC detailed level, the “detailed” designation indicates the most detailed data available: either a standard SOC broad occupation or an OEWS-specific combination of detailed occupations. Occupations that OEWS has aggregated to the SOC broad occupation level will appear in the file twice, once with the “broad” and once with the “detailed” designation.

This informs us that these are rows of data that have been included on purpose twice, due to changes in the way Occupational Employment and Wage Statistics (OEWS) manages the data.

The duplicates that have the detailed values can be dropped, as OEWS has aggregated them into the broad category.

In [104]:
nat_data_filtered = nat_data[(nat_data['occ_code'].isin(counts_thresh)) & (nat_data['o_group'] == 'detailed')]
In [105]:
nat_data_filtered
Out[105]:
area area_title area_type prim_state naics naics_title i_group own_code occ_code occ_title ... h_median h_pct75 h_pct90 a_pct10 a_pct25 a_median a_pct75 a_pct90 annual hourly
79 99 U.S. 1 US 0 Cross-industry cross-industry 1235 13-1020 Buyers and Purchasing Agents ... 36.37 47.69 61.31 46460 58670 75650 99190 127520 NaN NaN
112 99 U.S. 1 US 0 Cross-industry cross-industry 1235 13-2020 Property Appraisers and Assessors ... 31.45 43.66 59.02 38480 49310 65420 90810 122760 NaN NaN
574 99 U.S. 1 US 0 Cross-industry cross-industry 1235 29-2010 Clinical Laboratory Technologists and Technicians ... 29.75 38.46 47.11 38020 46580 61890 80010 97990 NaN NaN
613 99 U.S. 1 US 0 Cross-industry cross-industry 1235 31-1120 Home Health and Personal Care Aides ... 16.78 18.26 21.25 25600 30370 34900 37980 44190 NaN NaN
780 99 U.S. 1 US 0 Cross-industry cross-industry 1235 39-7010 Tour and Travel Guides ... 17.63 22.07 28.81 26890 31250 36660 45910 59930 NaN NaN
1046 99 U.S. 1 US 0 Cross-industry cross-industry 1235 47-4090 Miscellaneous Construction and Related Workers ... 23.14 29.31 37.28 35160 39990 48120 60960 77540 NaN NaN
1164 99 U.S. 1 US 0 Cross-industry cross-industry 1235 51-2090 Miscellaneous Assemblers and Fabricators ... 20.29 23.76 29.82 31650 36660 42210 49410 62030 NaN NaN

7 rows × 28 columns

In [106]:
nat_data_filtered['o_group']
Out[106]:
79      detailed
112     detailed
574     detailed
613     detailed
780     detailed
1046    detailed
1164    detailed
Name: o_group, dtype: object
In [107]:
nat_data_pre_drop = nat_data.drop(nat_data_filtered.index)
In [108]:
nat_data_pre_drop.head()
Out[108]:
area area_title area_type prim_state naics naics_title i_group own_code occ_code occ_title ... h_median h_pct75 h_pct90 a_pct10 a_pct25 a_median a_pct75 a_pct90 annual hourly
0 99 U.S. 1 US 0 Cross-industry cross-industry 1235 00-0000 All Occupations ... 23.8 37.89 60.44 29990 36730 49500 78810 125720 NaN NaN
1 99 U.S. 1 US 0 Cross-industry cross-industry 1235 11-0000 Management Occupations ... 58.7 82.5 # 57010 79900 122090 171610 # NaN NaN
2 99 U.S. 1 US 0 Cross-industry cross-industry 1235 11-1000 Top Executives ... 50.48 81.01 # 47510 68800 104990 168490 # NaN NaN
3 99 U.S. 1 US 0 Cross-industry cross-industry 1235 11-1010 Chief Executives ... 99.24 # # 73710 126080 206420 # # NaN NaN
4 99 U.S. 1 US 0 Cross-industry cross-industry 1235 11-1011 Chief Executives ... 99.24 # # 73710 126080 206420 # # NaN NaN

5 rows × 28 columns

In [109]:
nat_data_pre_drop.info()
<class 'pandas.core.frame.DataFrame'>
Index: 1396 entries, 0 to 1402
Data columns (total 28 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   area         1396 non-null   int64  
 1   area_title   1396 non-null   object 
 2   area_type    1396 non-null   int64  
 3   prim_state   1396 non-null   object 
 4   naics        1396 non-null   int64  
 5   naics_title  1396 non-null   object 
 6   i_group      1396 non-null   object 
 7   own_code     1396 non-null   int64  
 8   occ_code     1396 non-null   object 
 9   occ_title    1396 non-null   object 
 10  o_group      1396 non-null   object 
 11  tot_emp      1396 non-null   int64  
 12  emp_prse     1396 non-null   float64
 13  h_mean       1396 non-null   object 
 14  a_mean       1396 non-null   object 
 15  mean_prse    1396 non-null   float64
 16  h_pct10      1396 non-null   object 
 17  h_pct25      1396 non-null   object 
 18  h_median     1396 non-null   object 
 19  h_pct75      1396 non-null   object 
 20  h_pct90      1396 non-null   object 
 21  a_pct10      1396 non-null   object 
 22  a_pct25      1396 non-null   object 
 23  a_median     1396 non-null   object 
 24  a_pct75      1396 non-null   object 
 25  a_pct90      1396 non-null   object 
 26  annual       83 non-null     object 
 27  hourly       7 non-null      object 
dtypes: float64(2), int64(5), object(21)
memory usage: 316.3+ KB

We see that seven rows have been removed.

In [110]:
nat_data_pre_drop[nat_data_pre_drop.groupby('occ_code')['occ_code'].transform('count') > threshold]
Out[110]:
area area_title area_type prim_state naics naics_title i_group own_code occ_code occ_title ... h_median h_pct75 h_pct90 a_pct10 a_pct25 a_median a_pct75 a_pct90 annual hourly

0 rows × 28 columns

There are no duplicates.

In [111]:
nat_data = nat_data_pre_drop

Mark To Do Item Complete¶

  • Clean data duplicates

occ_title: SOC Column Cleaning Continuation¶

In [112]:
nat_data['occ_title'].unique()
Out[112]:
array(['All Occupations', 'Management Occupations', 'Top Executives', ...,
       'Tank Car, Truck, and Ship Loaders',
       'Miscellaneous Material Moving Workers',
       'Material Moving Workers, All Other'], dtype=object)
In [113]:
print(nat_desc.loc[nat_desc['Field'] == 'occ_title', 'Field Description'].values[0])
SOC title or OEWS-specific title for the occupation
In [114]:
nat_data['occ_title'].describe()
Out[114]:
count                   1396
unique                  1138
top       Massage Therapists
freq                       2
Name: occ_title, dtype: object
In [115]:
nat_data['occ_title'].info()
<class 'pandas.core.series.Series'>
Index: 1396 entries, 0 to 1402
Series name: occ_title
Non-Null Count  Dtype 
--------------  ----- 
1396 non-null   object
dtypes: object(1)
memory usage: 21.8+ KB
In [116]:
print(type(nat_data['occ_title'][0]))
<class 'str'>
In [117]:
counts = nat_data['occ_title'].value_counts()
In [118]:
counts
Out[118]:
occ_title
Massage Therapists                                 2
Biological Technicians                             2
Couriers and Messengers                            2
Cargo and Freight Agents                           2
Printing Workers                                   2
                                                  ..
Healthcare Diagnosing or Treating Practitioners    1
Dentists                                           1
Dentists, General                                  1
Oral and Maxillofacial Surgeons                    1
Material Moving Workers, All Other                 1
Name: count, Length: 1138, dtype: int64

We seem to have found more duplicate values.

In [119]:
nat_data[nat_data.groupby('occ_title')['occ_title'].transform('count') > threshold]
Out[119]:
area area_title area_type prim_state naics naics_title i_group own_code occ_code occ_title ... h_median h_pct75 h_pct90 a_pct10 a_pct25 a_median a_pct75 a_pct90 annual hourly
3 99 U.S. 1 US 0 Cross-industry cross-industry 1235 11-1010 Chief Executives ... 99.24 # # 73710 126080 206420 # # NaN NaN
4 99 U.S. 1 US 0 Cross-industry cross-industry 1235 11-1011 Chief Executives ... 99.24 # # 73710 126080 206420 # # NaN NaN
5 99 U.S. 1 US 0 Cross-industry cross-industry 1235 11-1020 General and Operations Managers ... 49.5 78.91 # 47420 67160 102950 164130 # NaN NaN
6 99 U.S. 1 US 0 Cross-industry cross-industry 1235 11-1021 General and Operations Managers ... 49.5 78.91 # 47420 67160 102950 164130 # NaN NaN
7 99 U.S. 1 US 0 Cross-industry cross-industry 1235 11-1030 Legislators ... * * * 20380 29120 44810 80350 137820 True NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1386 99 U.S. 1 US 0 Cross-industry cross-industry 1235 53-7051 Industrial Truck and Tractor Operators ... 22.3 25.81 29.59 36500 39780 46390 53680 61540 NaN NaN
1397 99 U.S. 1 US 0 Cross-industry cross-industry 1235 53-7080 Refuse and Recyclable Material Collectors ... 23.24 29.33 36.16 31810 38330 48350 61010 75200 NaN NaN
1398 99 U.S. 1 US 0 Cross-industry cross-industry 1235 53-7081 Refuse and Recyclable Material Collectors ... 23.24 29.33 36.16 31810 38330 48350 61010 75200 NaN NaN
1399 99 U.S. 1 US 0 Cross-industry cross-industry 1235 53-7120 Tank Car, Truck, and Ship Loaders ... 27.92 34.25 42.36 38260 47260 58070 71230 88120 NaN NaN
1400 99 U.S. 1 US 0 Cross-industry cross-industry 1235 53-7121 Tank Car, Truck, and Ship Loaders ... 27.92 34.25 42.36 38260 47260 58070 71230 88120 NaN NaN

516 rows × 28 columns

In [120]:
with pd.option_context('display.max_columns', None):
    print(nat_data[nat_data.groupby('occ_title')['occ_title'].transform('count') > threshold].head(2))
   area area_title  area_type prim_state  naics     naics_title  \
3    99       U.S.          1         US      0  Cross-industry   
4    99       U.S.          1         US      0  Cross-industry   

          i_group  own_code occ_code         occ_title   o_group  tot_emp  \
3  cross-industry      1235  11-1010  Chief Executives     broad   211850   
4  cross-industry      1235  11-1011  Chief Executives  detailed   211850   

   emp_prse  h_mean  a_mean  mean_prse h_pct10 h_pct25 h_median h_pct75  \
3       1.2  126.41  262930        0.9   35.44   60.61    99.24       #   
4       1.2  126.41  262930        0.9   35.44   60.61    99.24       #   

  h_pct90 a_pct10 a_pct25 a_median a_pct75 a_pct90 annual hourly  
3       #   73710  126080   206420       #       #    NaN    NaN  
4       #   73710  126080   206420       #       #    NaN    NaN  

These duplicates seem to have a different value in the occ_code and the o_group code.

Likely, the o_group issue is a continuation of the issue from before, where a formerly detailed dataset has been aggregated into a broad dataset.

The occ_code appears to be raised by one digit lower as a part of the aggregation methodology shift.

Let's consider that the detailed rows can, again, are duplicates and can therefore be dropped.

Add To Do Item¶

  • Remove duplicate rows discovered via the occ_title column.

Clean Duplicate Rows¶

In [121]:
counts_thresh = counts[counts > threshold].index
In [122]:
nat_data_filtered = nat_data[(nat_data['occ_title'].isin(counts_thresh)) & (nat_data['o_group'] == 'detailed')]
In [123]:
nat_data_filtered
Out[123]:
area area_title area_type prim_state naics naics_title i_group own_code occ_code occ_title ... h_median h_pct75 h_pct90 a_pct10 a_pct25 a_median a_pct75 a_pct90 annual hourly
4 99 U.S. 1 US 0 Cross-industry cross-industry 1235 11-1011 Chief Executives ... 99.24 # # 73710 126080 206420 # # NaN NaN
6 99 U.S. 1 US 0 Cross-industry cross-industry 1235 11-1021 General and Operations Managers ... 49.5 78.91 # 47420 67160 102950 164130 # NaN NaN
8 99 U.S. 1 US 0 Cross-industry cross-industry 1235 11-1031 Legislators ... * * * 20380 29120 44810 80350 137820 True NaN
11 99 U.S. 1 US 0 Cross-industry cross-industry 1235 11-2011 Advertising and Promotions Managers ... 61.04 85.85 # 63000 85990 126960 178570 # NaN NaN
23 99 U.S. 1 US 0 Cross-industry cross-industry 1235 11-3021 Computer and Information Systems Managers ... 82.31 103.95 # 104450 134350 171200 216220 # NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1382 99 U.S. 1 US 0 Cross-industry cross-industry 1235 53-7031 Dredge Operators ... 23.28 28.99 36.08 42060 46120 48430 60300 75050 NaN NaN
1384 99 U.S. 1 US 0 Cross-industry cross-industry 1235 53-7041 Hoist and Winch Operators ... 25.15 43.37 55.83 33910 39220 52310 90200 116120 NaN NaN
1386 99 U.S. 1 US 0 Cross-industry cross-industry 1235 53-7051 Industrial Truck and Tractor Operators ... 22.3 25.81 29.59 36500 39780 46390 53680 61540 NaN NaN
1398 99 U.S. 1 US 0 Cross-industry cross-industry 1235 53-7081 Refuse and Recyclable Material Collectors ... 23.24 29.33 36.16 31810 38330 48350 61010 75200 NaN NaN
1400 99 U.S. 1 US 0 Cross-industry cross-industry 1235 53-7121 Tank Car, Truck, and Ship Loaders ... 27.92 34.25 42.36 38260 47260 58070 71230 88120 NaN NaN

249 rows × 28 columns

In [124]:
nat_data_filtered['o_group'].unique()
Out[124]:
array(['detailed'], dtype=object)

On the one hand, we seem to have made some progress. We have found 249 rows that can be dropped.

However, there were 516 rows originally. We are missing perhaps 9 rows.

While we could write some complicated logic at this stage to try to figure out why those rows are not yet included, we don't have to do so. When we drop the known duplicate rows and rerun the original duplicate-finding logic, we'll see them in isolation.

In [125]:
nat_data_pre_drop = nat_data.drop(nat_data_filtered.index)
In [126]:
nat_data_pre_drop.info()
<class 'pandas.core.frame.DataFrame'>
Index: 1147 entries, 0 to 1402
Data columns (total 28 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   area         1147 non-null   int64  
 1   area_title   1147 non-null   object 
 2   area_type    1147 non-null   int64  
 3   prim_state   1147 non-null   object 
 4   naics        1147 non-null   int64  
 5   naics_title  1147 non-null   object 
 6   i_group      1147 non-null   object 
 7   own_code     1147 non-null   int64  
 8   occ_code     1147 non-null   object 
 9   occ_title    1147 non-null   object 
 10  o_group      1147 non-null   object 
 11  tot_emp      1147 non-null   int64  
 12  emp_prse     1147 non-null   float64
 13  h_mean       1147 non-null   object 
 14  a_mean       1147 non-null   object 
 15  mean_prse    1147 non-null   float64
 16  h_pct10      1147 non-null   object 
 17  h_pct25      1147 non-null   object 
 18  h_median     1147 non-null   object 
 19  h_pct75      1147 non-null   object 
 20  h_pct90      1147 non-null   object 
 21  a_pct10      1147 non-null   object 
 22  a_pct25      1147 non-null   object 
 23  a_median     1147 non-null   object 
 24  a_pct75      1147 non-null   object 
 25  a_pct90      1147 non-null   object 
 26  annual       80 non-null     object 
 27  hourly       7 non-null      object 
dtypes: float64(2), int64(5), object(21)
memory usage: 259.9+ KB

Part 2: Analyzing for Additional Duplicates via occ_title¶

With the first set of duplicates removed, we should now be able to see the remaining duplicates.

In [127]:
counts = nat_data_pre_drop['occ_title'].value_counts()
In [128]:
counts
Out[128]:
occ_title
Supervisors of Food Preparation and Serving Workers    2
Baggage Porters, Bellhops, and Concierges              2
Sales Representatives, Wholesale and Manufacturing     2
Grounds Maintenance Workers                            2
Tour and Travel Guides                                 2
                                                      ..
Athletes, Coaches, Umpires, and Related Workers        1
Athletes and Sports Competitors                        1
Coaches and Scouts                                     1
Umpires, Referees, and Other Sports Officials          1
Material Moving Workers, All Other                     1
Name: count, Length: 1138, dtype: int64
In [129]:
nat_data_pre_drop[nat_data_pre_drop.groupby('occ_title')['occ_title'].transform('count') > threshold]
Out[129]:
area area_title area_type prim_state naics naics_title i_group own_code occ_code occ_title ... h_median h_pct75 h_pct90 a_pct10 a_pct25 a_median a_pct75 a_pct90 annual hourly
304 99 U.S. 1 US 0 Cross-industry cross-industry 1235 19-5000 Occupational Health and Safety Specialists and... ... 37.93 48.79 61.13 47790 59620 78900 101490 127140 NaN NaN
305 99 U.S. 1 US 0 Cross-industry cross-industry 1235 19-5010 Occupational Health and Safety Specialists and... ... 37.93 48.79 61.13 47790 59620 78900 101490 127140 NaN NaN
681 99 U.S. 1 US 0 Cross-industry cross-industry 1235 35-1000 Supervisors of Food Preparation and Serving Wo... ... 21.22 27.4 34.58 29600 35840 44140 56990 71920 NaN NaN
682 99 U.S. 1 US 0 Cross-industry cross-industry 1235 35-1010 Supervisors of Food Preparation and Serving Wo... ... 21.22 27.4 34.58 29600 35840 44140 56990 71920 NaN NaN
725 99 U.S. 1 US 0 Cross-industry cross-industry 1235 37-3000 Grounds Maintenance Workers ... 18.5 22.3 27.14 30140 35470 38470 46390 56460 NaN NaN
726 99 U.S. 1 US 0 Cross-industry cross-industry 1235 37-3010 Grounds Maintenance Workers ... 18.5 22.3 27.14 30140 35470 38470 46390 56460 NaN NaN
774 99 U.S. 1 US 0 Cross-industry cross-industry 1235 39-6000 Baggage Porters, Bellhops, and Concierges ... 17.7 20.96 26.13 28360 32560 36810 43600 54340 NaN NaN
775 99 U.S. 1 US 0 Cross-industry cross-industry 1235 39-6010 Baggage Porters, Bellhops, and Concierges ... 17.7 20.96 26.13 28360 32560 36810 43600 54340 NaN NaN
778 99 U.S. 1 US 0 Cross-industry cross-industry 1235 39-7000 Tour and Travel Guides ... 17.63 22.07 28.81 26890 31250 36660 45910 59930 NaN NaN
779 99 U.S. 1 US 0 Cross-industry cross-industry 1235 39-7010 Tour and Travel Guides ... 17.63 22.07 28.81 26890 31250 36660 45910 59930 NaN NaN
816 99 U.S. 1 US 0 Cross-industry cross-industry 1235 41-4000 Sales Representatives, Wholesale and Manufactu... ... 35.63 49.61 74.97 38910 50820 74100 103200 155930 NaN NaN
817 99 U.S. 1 US 0 Cross-industry cross-industry 1235 41-4010 Sales Representatives, Wholesale and Manufactu... ... 35.63 49.61 74.97 38910 50820 74100 103200 155930 NaN NaN
917 99 U.S. 1 US 0 Cross-industry cross-industry 1235 43-6000 Secretaries and Administrative Assistants ... 22.82 28.84 36.81 33840 38790 47460 59990 76550 NaN NaN
918 99 U.S. 1 US 0 Cross-industry cross-industry 1235 43-6010 Secretaries and Administrative Assistants ... 22.82 28.84 36.81 33840 38790 47460 59990 76550 NaN NaN
1021 99 U.S. 1 US 0 Cross-industry cross-industry 1235 47-3000 Helpers, Construction Trades ... 19.44 22.94 27.64 31360 36280 40430 47710 57480 NaN NaN
1022 99 U.S. 1 US 0 Cross-industry cross-industry 1235 47-3010 Helpers, Construction Trades ... 19.44 22.94 27.64 31360 36280 40430 47710 57480 NaN NaN
1212 99 U.S. 1 US 0 Cross-industry cross-industry 1235 51-5100 Printing Workers ... 21.55 25.24 30.06 31450 36950 44830 52510 62530 NaN NaN
1213 99 U.S. 1 US 0 Cross-industry cross-industry 1235 51-5110 Printing Workers ... 21.55 25.24 30.06 31450 36950 44830 52510 62530 NaN NaN

18 rows × 28 columns

There are still 9 duplicates, as expected.

In [130]:
with pd.option_context('display.max_columns', None):
    print(nat_data_pre_drop[nat_data_pre_drop.groupby('occ_title')['occ_title'].transform('count') > threshold])
      area area_title  area_type prim_state  naics     naics_title  \
304     99       U.S.          1         US      0  Cross-industry   
305     99       U.S.          1         US      0  Cross-industry   
681     99       U.S.          1         US      0  Cross-industry   
682     99       U.S.          1         US      0  Cross-industry   
725     99       U.S.          1         US      0  Cross-industry   
726     99       U.S.          1         US      0  Cross-industry   
774     99       U.S.          1         US      0  Cross-industry   
775     99       U.S.          1         US      0  Cross-industry   
778     99       U.S.          1         US      0  Cross-industry   
779     99       U.S.          1         US      0  Cross-industry   
816     99       U.S.          1         US      0  Cross-industry   
817     99       U.S.          1         US      0  Cross-industry   
917     99       U.S.          1         US      0  Cross-industry   
918     99       U.S.          1         US      0  Cross-industry   
1021    99       U.S.          1         US      0  Cross-industry   
1022    99       U.S.          1         US      0  Cross-industry   
1212    99       U.S.          1         US      0  Cross-industry   
1213    99       U.S.          1         US      0  Cross-industry   

             i_group  own_code occ_code  \
304   cross-industry      1235  19-5000   
305   cross-industry      1235  19-5010   
681   cross-industry      1235  35-1000   
682   cross-industry      1235  35-1010   
725   cross-industry      1235  37-3000   
726   cross-industry      1235  37-3010   
774   cross-industry      1235  39-6000   
775   cross-industry      1235  39-6010   
778   cross-industry      1235  39-7000   
779   cross-industry      1235  39-7010   
816   cross-industry      1235  41-4000   
817   cross-industry      1235  41-4010   
917   cross-industry      1235  43-6000   
918   cross-industry      1235  43-6010   
1021  cross-industry      1235  47-3000   
1022  cross-industry      1235  47-3010   
1212  cross-industry      1235  51-5100   
1213  cross-industry      1235  51-5110   

                                              occ_title o_group  tot_emp  \
304   Occupational Health and Safety Specialists and...   minor   159880   
305   Occupational Health and Safety Specialists and...   broad   159880   
681   Supervisors of Food Preparation and Serving Wo...   minor  1369780   
682   Supervisors of Food Preparation and Serving Wo...   broad  1369780   
725                         Grounds Maintenance Workers   minor  1030070   
726                         Grounds Maintenance Workers   broad  1030070   
774           Baggage Porters, Bellhops, and Concierges   minor    75430   
775           Baggage Porters, Bellhops, and Concierges   broad    75430   
778                              Tour and Travel Guides   minor    49010   
779                              Tour and Travel Guides   broad    49010   
816   Sales Representatives, Wholesale and Manufactu...   minor  1560790   
817   Sales Representatives, Wholesale and Manufactu...   broad  1560790   
917           Secretaries and Administrative Assistants   minor  3195900   
918           Secretaries and Administrative Assistants   broad  3195900   
1021                       Helpers, Construction Trades   minor   186250   
1022                       Helpers, Construction Trades   broad   186250   
1212                                   Printing Workers   minor   204640   
1213                                   Printing Workers   broad   204640   

      emp_prse h_mean a_mean  mean_prse h_pct10 h_pct25 h_median h_pct75  \
304        1.5  40.27  83770        0.6   22.98   28.66    37.93   48.79   
305        1.5  40.27  83770        0.6   22.98   28.66    37.93   48.79   
681        0.7  22.85  47540        0.5   14.23   17.23    21.22    27.4   
682        0.7  22.85  47540        0.5   14.23   17.23    21.22    27.4   
725        0.5  20.07  41750        0.3   14.49   17.05     18.5    22.3   
726        0.5  20.07  41750        0.3   14.49   17.05     18.5    22.3   
774        3.3  18.84  39190        0.9   13.63   15.65     17.7   20.96   
775        3.3  18.84  39190        0.9   13.63   15.65     17.7   20.96   
778        2.7  20.72  43090        3.3   12.93   15.03    17.63   22.07   
779        2.7  20.72  43090        3.3   12.93   15.03    17.63   22.07   
816        0.5  42.16  87690        0.5   18.71   24.43    35.63   49.61   
817        0.5  42.16  87690        0.5   18.71   24.43    35.63   49.61   
917        0.7  25.03  52070        0.2   16.27   18.65    22.82   28.84   
918        0.7  25.03  52070        0.2   16.27   18.65    22.82   28.84   
1021       1.6  20.74  43140        0.7   15.08   17.44    19.44   22.94   
1022       1.6  20.74  43140        0.7   15.08   17.44    19.44   22.94   
1212       1.0  22.24  46260        0.3   15.12   17.76    21.55   25.24   
1213       1.0  22.24  46260        0.3   15.12   17.76    21.55   25.24   

     h_pct90 a_pct10 a_pct25 a_median a_pct75 a_pct90 annual hourly  
304    61.13   47790   59620    78900  101490  127140    NaN    NaN  
305    61.13   47790   59620    78900  101490  127140    NaN    NaN  
681    34.58   29600   35840    44140   56990   71920    NaN    NaN  
682    34.58   29600   35840    44140   56990   71920    NaN    NaN  
725    27.14   30140   35470    38470   46390   56460    NaN    NaN  
726    27.14   30140   35470    38470   46390   56460    NaN    NaN  
774    26.13   28360   32560    36810   43600   54340    NaN    NaN  
775    26.13   28360   32560    36810   43600   54340    NaN    NaN  
778    28.81   26890   31250    36660   45910   59930    NaN    NaN  
779    28.81   26890   31250    36660   45910   59930    NaN    NaN  
816    74.97   38910   50820    74100  103200  155930    NaN    NaN  
817    74.97   38910   50820    74100  103200  155930    NaN    NaN  
917    36.81   33840   38790    47460   59990   76550    NaN    NaN  
918    36.81   33840   38790    47460   59990   76550    NaN    NaN  
1021   27.64   31360   36280    40430   47710   57480    NaN    NaN  
1022   27.64   31360   36280    40430   47710   57480    NaN    NaN  
1212   30.06   31450   36950    44830   52510   62530    NaN    NaN  
1213   30.06   31450   36950    44830   52510   62530    NaN    NaN  

This shows that these duplicates are due to a difference between minor and broad specifications in the o_group column.

In [131]:
print(nat_desc.loc[nat_desc['Field'] == 'o_group', 'Field Description'].values[0])
SOC occupation level. For most occupations, this field indicates the standard SOC major, minor, broad, and detailed levels, in addition to all-occupations totals. For occupations that OEWS no longer publishes at the SOC detailed level, the “detailed” designation indicates the most detailed data available: either a standard SOC broad occupation or an OEWS-specific combination of detailed occupations. Occupations that OEWS has aggregated to the SOC broad occupation level will appear in the file twice, once with the “broad” and once with the “detailed” designation.

The description of this column does not provide information about how to handle these minor/broad duplicates.

They are clearly duplicates, as evidenced by the identical data in nearly all other columns between the matching pairs.

Therefore, at least one duplicate for each needs to be dropped.

Let's revisit the SOC User Guide:

The 2000 SOC classifies workers at four levels of aggregation: 1) major group; 2) minor group; 3) broad occupation; and 4) detailed occupation. All occupations are clustered into one of the following 23 major groups:

Within these major groups are 96 minor groups, 449 broad occupations, and 821 detailed occupations. Occupations with similar skills or work activities are grouped at each of the four levels of hierarchy to facilitate comparisons. For example, "Life, Physical and Social Science Occupations" (19-0000) is divided into four minor groups, "Life Scientists" (19-1000), "Physical Scientists" (19-2000), "Social Scientists and Related Workers" (19-3000), and "Life, Physical and Social Science Technicians" (19-4000). Life Scientists contains broad occupations such as "Agriculture and Food Scientists" (19-1010), and "Biological Scientists" (19-1020). The broad occupation Biological Scientists includes detailed occupations such as "Biochemists and Biophysicists" (19-1021), and "Microbiologists" (19-1022).

This creates the vague understanding that these row duplicates are the same occupational fields surveyed, but the fields are categorized twice -- as both minor and broad groups. The purpose for this duplication is unknown.

According to the user guide, the minor group only has 96 instances across the whole database. We already have only 94 such groups, as shown previously.

We do know that when BLS adds duplicates, it does so because it has aggregated data from a more detailed into a more broad direction. The BLS has a track record of making things more general, not more detailed.

Let's continue in this direction by dropping the broad rows and keeping the more general minor rows.

Part 3: Dropping Additional Duplicates¶

In [132]:
counts_thresh = counts[counts > threshold].index
In [133]:
nat_data_filtered = nat_data_pre_drop[(nat_data_pre_drop['occ_title'].isin(counts_thresh)) & (nat_data_pre_drop['o_group'] == 'broad')]
In [134]:
nat_data_filtered
Out[134]:
area area_title area_type prim_state naics naics_title i_group own_code occ_code occ_title ... h_median h_pct75 h_pct90 a_pct10 a_pct25 a_median a_pct75 a_pct90 annual hourly
305 99 U.S. 1 US 0 Cross-industry cross-industry 1235 19-5010 Occupational Health and Safety Specialists and... ... 37.93 48.79 61.13 47790 59620 78900 101490 127140 NaN NaN
682 99 U.S. 1 US 0 Cross-industry cross-industry 1235 35-1010 Supervisors of Food Preparation and Serving Wo... ... 21.22 27.4 34.58 29600 35840 44140 56990 71920 NaN NaN
726 99 U.S. 1 US 0 Cross-industry cross-industry 1235 37-3010 Grounds Maintenance Workers ... 18.5 22.3 27.14 30140 35470 38470 46390 56460 NaN NaN
775 99 U.S. 1 US 0 Cross-industry cross-industry 1235 39-6010 Baggage Porters, Bellhops, and Concierges ... 17.7 20.96 26.13 28360 32560 36810 43600 54340 NaN NaN
779 99 U.S. 1 US 0 Cross-industry cross-industry 1235 39-7010 Tour and Travel Guides ... 17.63 22.07 28.81 26890 31250 36660 45910 59930 NaN NaN
817 99 U.S. 1 US 0 Cross-industry cross-industry 1235 41-4010 Sales Representatives, Wholesale and Manufactu... ... 35.63 49.61 74.97 38910 50820 74100 103200 155930 NaN NaN
918 99 U.S. 1 US 0 Cross-industry cross-industry 1235 43-6010 Secretaries and Administrative Assistants ... 22.82 28.84 36.81 33840 38790 47460 59990 76550 NaN NaN
1022 99 U.S. 1 US 0 Cross-industry cross-industry 1235 47-3010 Helpers, Construction Trades ... 19.44 22.94 27.64 31360 36280 40430 47710 57480 NaN NaN
1213 99 U.S. 1 US 0 Cross-industry cross-industry 1235 51-5110 Printing Workers ... 21.55 25.24 30.06 31450 36950 44830 52510 62530 NaN NaN

9 rows × 28 columns

In [135]:
nat_data_pre_drop_2 = nat_data_pre_drop.drop(nat_data_filtered.index)
In [136]:
nat_data_pre_drop_2
Out[136]:
area area_title area_type prim_state naics naics_title i_group own_code occ_code occ_title ... h_median h_pct75 h_pct90 a_pct10 a_pct25 a_median a_pct75 a_pct90 annual hourly
0 99 U.S. 1 US 0 Cross-industry cross-industry 1235 00-0000 All Occupations ... 23.8 37.89 60.44 29990 36730 49500 78810 125720 NaN NaN
1 99 U.S. 1 US 0 Cross-industry cross-industry 1235 11-0000 Management Occupations ... 58.7 82.5 # 57010 79900 122090 171610 # NaN NaN
2 99 U.S. 1 US 0 Cross-industry cross-industry 1235 11-1000 Top Executives ... 50.48 81.01 # 47510 68800 104990 168490 # NaN NaN
3 99 U.S. 1 US 0 Cross-industry cross-industry 1235 11-1010 Chief Executives ... 99.24 # # 73710 126080 206420 # # NaN NaN
5 99 U.S. 1 US 0 Cross-industry cross-industry 1235 11-1020 General and Operations Managers ... 49.5 78.91 # 47420 67160 102950 164130 # NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1396 99 U.S. 1 US 0 Cross-industry cross-industry 1235 53-7073 Wellhead Pumpers ... 33.66 38.81 46.86 39110 54450 70010 80720 97470 NaN NaN
1397 99 U.S. 1 US 0 Cross-industry cross-industry 1235 53-7080 Refuse and Recyclable Material Collectors ... 23.24 29.33 36.16 31810 38330 48350 61010 75200 NaN NaN
1399 99 U.S. 1 US 0 Cross-industry cross-industry 1235 53-7120 Tank Car, Truck, and Ship Loaders ... 27.92 34.25 42.36 38260 47260 58070 71230 88120 NaN NaN
1401 99 U.S. 1 US 0 Cross-industry cross-industry 1235 53-7190 Miscellaneous Material Moving Workers ... 20.04 24.45 31.18 33280 35470 41690 50850 64850 NaN NaN
1402 99 U.S. 1 US 0 Cross-industry cross-industry 1235 53-7199 Material Moving Workers, All Other ... 20.04 24.45 31.18 33280 35470 41690 50850 64850 NaN NaN

1138 rows × 28 columns

Before this data cleaning instance, we had 1396 rows. Our original number of rows that had duplicate data was 516.

In [137]:
(1396 - 1138) * 2
Out[137]:
516

We appear to have dropped the correct amount of rows for this attempt.

In [138]:
nat_data_pre_drop_2[nat_data_pre_drop_2.groupby('occ_title')['occ_title'].transform('count') > threshold]
Out[138]:
area area_title area_type prim_state naics naics_title i_group own_code occ_code occ_title ... h_median h_pct75 h_pct90 a_pct10 a_pct25 a_median a_pct75 a_pct90 annual hourly

0 rows × 28 columns

In [139]:
nat_data_pre_drop_2.info()
<class 'pandas.core.frame.DataFrame'>
Index: 1138 entries, 0 to 1402
Data columns (total 28 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   area         1138 non-null   int64  
 1   area_title   1138 non-null   object 
 2   area_type    1138 non-null   int64  
 3   prim_state   1138 non-null   object 
 4   naics        1138 non-null   int64  
 5   naics_title  1138 non-null   object 
 6   i_group      1138 non-null   object 
 7   own_code     1138 non-null   int64  
 8   occ_code     1138 non-null   object 
 9   occ_title    1138 non-null   object 
 10  o_group      1138 non-null   object 
 11  tot_emp      1138 non-null   int64  
 12  emp_prse     1138 non-null   float64
 13  h_mean       1138 non-null   object 
 14  a_mean       1138 non-null   object 
 15  mean_prse    1138 non-null   float64
 16  h_pct10      1138 non-null   object 
 17  h_pct25      1138 non-null   object 
 18  h_median     1138 non-null   object 
 19  h_pct75      1138 non-null   object 
 20  h_pct90      1138 non-null   object 
 21  a_pct10      1138 non-null   object 
 22  a_pct25      1138 non-null   object 
 23  a_median     1138 non-null   object 
 24  a_pct75      1138 non-null   object 
 25  a_pct90      1138 non-null   object 
 26  annual       80 non-null     object 
 27  hourly       7 non-null      object 
dtypes: float64(2), int64(5), object(21)
memory usage: 257.8+ KB
In [140]:
nat_data = nat_data_pre_drop_2

Remove To Do Item¶

  • Remove duplicate rows discovered via the occ_title column.

o_group: SOC Column Cleaning Continued¶

In [141]:
nat_data['o_group'].unique()
Out[141]:
array(['total', 'major', 'minor', 'broad', 'detailed'], dtype=object)
In [142]:
nat_data['o_group'].value_counts()
Out[142]:
o_group
detailed    575
broad       446
minor        94
major        22
total         1
Name: count, dtype: int64

The total value in the o_group column is new to this analysis.

In [143]:
print(nat_desc.loc[nat_desc['Field'] == 'o_group', 'Field Description'].values[0])
SOC occupation level. For most occupations, this field indicates the standard SOC major, minor, broad, and detailed levels, in addition to all-occupations totals. For occupations that OEWS no longer publishes at the SOC detailed level, the “detailed” designation indicates the most detailed data available: either a standard SOC broad occupation or an OEWS-specific combination of detailed occupations. Occupations that OEWS has aggregated to the SOC broad occupation level will appear in the file twice, once with the “broad” and once with the “detailed” designation.

We seem to have completed the process of cleaning this column at this time. However, as we delve deeper into the data, perhaps more revelations may shed light on data quality.

'tot_emp' Wrangle¶

In [144]:
nat_data['tot_emp'].duplicated().sum()
Out[144]:
np.int64(68)

We seem to have 68 rows with duplicated values in tot_emp.

In [145]:
with pd.option_context('display.max_columns', None):
    print(nat_data[nat_data.groupby('tot_emp')['tot_emp'].transform('count') > threshold].sort_values(by='tot_emp', ascending=False))
      area area_title  area_type prim_state  naics     naics_title  \
835     99       U.S.          1         US      0  Cross-industry   
836     99       U.S.          1         US      0  Cross-industry   
793     99       U.S.          1         US      0  Cross-industry   
792     99       U.S.          1         US      0  Cross-industry   
815     99       U.S.          1         US      0  Cross-industry   
...    ...        ...        ...        ...    ...             ...   
1005    99       U.S.          1         US      0  Cross-industry   
844     99       U.S.          1         US      0  Cross-industry   
843     99       U.S.          1         US      0  Cross-industry   
566     99       U.S.          1         US      0  Cross-industry   
269     99       U.S.          1         US      0  Cross-industry   

             i_group  own_code occ_code  \
835   cross-industry      1235  43-1000   
836   cross-industry      1235  43-1010   
793   cross-industry      1235  41-1010   
792   cross-industry      1235  41-1000   
815   cross-industry      1235  41-3091   
...              ...       ...      ...   
1005  cross-industry      1235  47-2142   
844   cross-industry      1235  43-2099   
843   cross-industry      1235  43-2090   
566   cross-industry      1235  29-1243   
269   cross-industry      1235  19-3032   

                                              occ_title   o_group  tot_emp  \
835   Supervisors of Office and Administrative Suppo...     minor  1495580   
836   First-Line Supervisors of Office and Administr...     broad  1495580   
793             First-Line Supervisors of Sales Workers     broad  1332180   
792                        Supervisors of Sales Workers     minor  1332180   
815   Sales Representatives of Services, Except Adve...  detailed  1189330   
...                                                 ...       ...      ...   
1005                                       Paperhangers  detailed     1520   
844       Communications Equipment Operators, All Other  detailed     1390   
843    Miscellaneous Communications Equipment Operators     broad     1390   
566                                  Pediatric Surgeons  detailed     1050   
269             Industrial-Organizational Psychologists  detailed     1050   

      emp_prse  h_mean  a_mean  mean_prse h_pct10 h_pct25 h_median h_pct75  \
835        0.4    34.4   71560        0.2   21.12   25.57     31.8   39.59   
836        0.4    34.4   71560        0.2   21.12   25.57     31.8   39.59   
793        0.4   28.77   59830        0.3   15.69   18.61    23.81   32.19   
792        0.4   28.77   59830        0.3   15.69   18.61    23.81   32.19   
815        0.7   39.07   81260        0.5   17.76   22.65    31.86   47.49   
...        ...     ...     ...        ...     ...     ...      ...     ...   
1005      13.7   24.95   51900        3.9   16.84   19.68     23.2   28.11   
844        9.9   26.87   55890        2.1    17.3   21.45       24   30.92   
843        9.9   26.87   55890        2.1    17.3   21.45       24   30.92   
566       23.4  216.74  450810        5.4   91.21       #        #       #   
269       24.7   64.62  134400       11.6   24.94   38.84    52.81   95.27   

     h_pct90 a_pct10 a_pct25 a_median a_pct75 a_pct90 annual hourly  
835    49.51   43920   53190    66140   82340  102980    NaN    NaN  
836    49.51   43920   53190    66140   82340  102980    NaN    NaN  
793    45.76   32630   38710    49510   66960   95170    NaN    NaN  
792    45.76   32630   38710    49510   66960   95170    NaN    NaN  
815    68.29   36930   47120    66260   98780  142040    NaN    NaN  
...      ...     ...     ...      ...     ...     ...    ...    ...  
1005    33.4   35020   40930    48260   58470   69470    NaN    NaN  
844    42.84   35980   44620    49910   64310   89110    NaN    NaN  
843    42.84   35980   44620    49910   64310   89110    NaN    NaN  
566        #  189720       #        #       #       #    NaN    NaN  
269   107.97   51880   80790   109840  198170  224590    NaN    NaN  

[130 rows x 28 columns]
Observations¶

There seems to be two types of duplicates here. Some appear to be the same jobs with the same reported metrics, again with differences in the o_group column.

However, others appear to be from completely different industries, with different metrics. Therefore, sometiems, different field simply have the same number of total workers.

There are 130 rows, which is quite a few for a manual inspection.

(We also note that the 130 rows is different than the 68 duplicates reported previously. This may mean that some of these duplicates in tot_emp are happening at intervals higher than 2.)

We know from the SOC User Manual that the occ_code column gives us information about job categorization.

One way to go forward could be to group duplicates by those that have similar starting digits in occ_code, and then those that do not can be ignored.

Add To Do Item¶

  • Isolate duplicates in tot_emp by occ_code
  • Remove duplicates

Isolating by occ_code¶

Our first task is to create a filtering dataframe to capture the rows that will be removed from the main dataframe in the end

In [146]:
nat_data_filtered = nat_data[nat_data.groupby('tot_emp')['tot_emp'].transform('count') > threshold].sort_values(by='tot_emp', ascending=False)
In [147]:
nat_data_filtered
Out[147]:
area area_title area_type prim_state naics naics_title i_group own_code occ_code occ_title ... h_median h_pct75 h_pct90 a_pct10 a_pct25 a_median a_pct75 a_pct90 annual hourly
835 99 U.S. 1 US 0 Cross-industry cross-industry 1235 43-1000 Supervisors of Office and Administrative Suppo... ... 31.8 39.59 49.51 43920 53190 66140 82340 102980 NaN NaN
836 99 U.S. 1 US 0 Cross-industry cross-industry 1235 43-1010 First-Line Supervisors of Office and Administr... ... 31.8 39.59 49.51 43920 53190 66140 82340 102980 NaN NaN
793 99 U.S. 1 US 0 Cross-industry cross-industry 1235 41-1010 First-Line Supervisors of Sales Workers ... 23.81 32.19 45.76 32630 38710 49510 66960 95170 NaN NaN
792 99 U.S. 1 US 0 Cross-industry cross-industry 1235 41-1000 Supervisors of Sales Workers ... 23.81 32.19 45.76 32630 38710 49510 66960 95170 NaN NaN
815 99 U.S. 1 US 0 Cross-industry cross-industry 1235 41-3091 Sales Representatives of Services, Except Adve... ... 31.86 47.49 68.29 36930 47120 66260 98780 142040 NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1005 99 U.S. 1 US 0 Cross-industry cross-industry 1235 47-2142 Paperhangers ... 23.2 28.11 33.4 35020 40930 48260 58470 69470 NaN NaN
844 99 U.S. 1 US 0 Cross-industry cross-industry 1235 43-2099 Communications Equipment Operators, All Other ... 24 30.92 42.84 35980 44620 49910 64310 89110 NaN NaN
843 99 U.S. 1 US 0 Cross-industry cross-industry 1235 43-2090 Miscellaneous Communications Equipment Operators ... 24 30.92 42.84 35980 44620 49910 64310 89110 NaN NaN
566 99 U.S. 1 US 0 Cross-industry cross-industry 1235 29-1243 Pediatric Surgeons ... # # # 189720 # # # # NaN NaN
269 99 U.S. 1 US 0 Cross-industry cross-industry 1235 19-3032 Industrial-Organizational Psychologists ... 52.81 95.27 107.97 51880 80790 109840 198170 224590 NaN NaN

130 rows × 28 columns

Let's remove columns that we don't need for this portion of the wrangling process.

In [148]:
col_lim_tot_emp = [
    'area',
    'area_title', 
    'area_type',
    'prim_state',
    'naics',
    'naics_title',
    'i_group',
    'own_code',
    'mean_prse',
    'h_pct10',
    'h_pct25',
    'h_median',
    'h_pct75', 
    'h_pct90',
    'a_pct10',
    'a_pct25',
    'a_median',
    'a_pct75', 
    'a_pct90',
    'annual',
    'hourly'
]
In [149]:
nat_data_filtered = nat_data_filtered.drop(columns=col_lim_tot_emp)
In [150]:
nat_data_filtered
Out[150]:
occ_code occ_title o_group tot_emp emp_prse h_mean a_mean
835 43-1000 Supervisors of Office and Administrative Suppo... minor 1495580 0.4 34.4 71560
836 43-1010 First-Line Supervisors of Office and Administr... broad 1495580 0.4 34.4 71560
793 41-1010 First-Line Supervisors of Sales Workers broad 1332180 0.4 28.77 59830
792 41-1000 Supervisors of Sales Workers minor 1332180 0.4 28.77 59830
815 41-3091 Sales Representatives of Services, Except Adve... detailed 1189330 0.7 39.07 81260
... ... ... ... ... ... ... ...
1005 47-2142 Paperhangers detailed 1520 13.7 24.95 51900
844 43-2099 Communications Equipment Operators, All Other detailed 1390 9.9 26.87 55890
843 43-2090 Miscellaneous Communications Equipment Operators broad 1390 9.9 26.87 55890
566 29-1243 Pediatric Surgeons detailed 1050 23.4 216.74 450810
269 19-3032 Industrial-Organizational Psychologists detailed 1050 24.7 64.62 134400

130 rows × 7 columns

Compare by occ_code Starting Digits¶

Looking at the SOC User Guide we can see details about the occ_code string meanings.

Occupational Coding

Each item in the hierarchy is designated by a six-digit code. The hyphen between the second and third digit is used only for presentation clarity. The first two digits of the SOC code represent the major group; the third digit represents the minor group; the fourth and fifth digits represent the broad occupation; and the detailed occupation is represented by the sixth digit. Major group codes end with 0000 (e.g., 33-0000, Protective Service Occupations), minor groups end with 000 (e.g., 33-2000, Fire Fighting Workers), and broad occupations end with 0 (e.g., 33-2020, Fire Inspectors). All residuals ("Other," "Miscellaneous," or "All Other"), whether at the detailed or broad occupation or minor group level, contain a 9 at the level of the residual. Detailed residual occupations end in 9 (e.g., 33-9199, Protective Service Workers, All Other); broad occupations which are minor group residuals end in 90 (e.g., 33-9190, Miscellaneous Protective Service Workers); and minor groups which are major group residuals end in 9000 (e.g., 33-9000, Other Protective Service Workers): 33-0000 Protective Service Occupations

33-9000 Other Protective Service Workers

33-9190 Miscellaneous Protective Service Workers

33-9199 Protective Service Workers, All Other

By setting the string depth to 5, we can narrow down our results to only those which have the same number in the first digit of the broad category.

In [151]:
str_depth = 5
In [152]:
# From the filtered dataset of duplicates, capture the count of the broad-level categories
# Return the series as a dataframe
first_str = pd.DataFrame(nat_data_filtered['occ_code'].str[:str_depth].value_counts())
In [153]:
with pd.option_context('display.max_rows', None):
    print(first_str)
          count
occ_code       
27-30         4
51-40         4
49-90         4
17-21         4
19-20         3
27-20         3
19-10         3
47-50         3
25-30         3
43-91         3
27-40         3
51-70         3
15-12         3
19-40         3
11-91         3
19-30         3
29-12         3
37-10         3
45-10         2
43-30         2
29-10         2
43-10         2
53-71         2
53-30         2
51-91         2
21-20         2
53-60         2
27-10         2
15-20         2
53-40         2
17-30         2
43-20         2
41-10         2
47-10         2
39-90         2
53-10         2
13-20         2
51-10         2
25-90         2
49-10         2
35-90         2
51-41         2
13-11         2
41-30         2
43-41         2
51-80         1
33-90         1
39-40         1
47-21         1
51-20         1
51-30         1
47-20         1
25-20         1
11-30         1
17-20         1
35-20         1
53-50         1
51-90         1
29-90         1
21-10         1
17-10         1
41-90         1
11-90         1

Observe tot_emp Duplicates with Only One occ_code Instance Each¶

Those that have a count of 1 are unlikely to be duplicates. Let's review those first.

In [154]:
# From the list of rows of the broad-level occ_code's
# capture those that only appear once
# Return the index of the resulting series as a dataframe 
first_str_lim = pd.DataFrame(first_str[first_str['count'] == threshold].index)
In [155]:
first_str_lim.sort_values(by='occ_code', ascending=False)
Out[155]:
occ_code
11 53-50
12 51-90
0 51-80
5 51-30
4 51-20
3 47-21
6 47-20
16 41-90
2 39-40
10 35-20
1 33-90
13 29-90
7 25-20
14 21-10
9 17-20
15 17-10
17 11-90
8 11-30
In [156]:
# Create a dataframe to hold a list of the appropriate length broad_level rows
broad_level_list = pd.DataFrame(nat_data_filtered['occ_code'].str[:str_depth])
In [157]:
# Re-filter nat_data_filtered to capture only those rows where the broad-level code appears once 
broad_level_list_one = nat_data_filtered[broad_level_list['occ_code'].isin(first_str_lim['occ_code'])].sort_values(by='occ_code')
In [158]:
broad_level_list_one
Out[158]:
occ_code occ_title o_group tot_emp emp_prse h_mean a_mean
21 11-3013 Facilities Managers detailed 141090 0.7 55.06 114520
47 11-9039 Education Administrators, All Other detailed 53330 1.9 47.82 99460
173 17-1020 Surveyors, Cartographers, and Photogrammetrists broad 65870 2.4 37.46 77920
183 17-2040 Chemical Engineers broad 20330 3.7 61.75 128430
312 21-1013 Marriage and Family Therapists detailed 65870 4.0 34.96 72720
406 25-2023 Career/Technical Education Teachers, Middle Sc... detailed 14200 3.6 * 68690
609 29-9099 Healthcare Practitioners and Technical Workers... detailed 36970 1.9 35.19 73200
672 33-9031 Gambling Surveillance Officers and Gambling In... detailed 10000 2.1 22.46 46710
692 35-2019 Cooks, All Other detailed 23590 7.4 18.27 38000
760 39-4012 Crematory Operators detailed 2950 5.7 21.53 44790
822 41-9011 Demonstrators and Product Promoters detailed 64770 5.4 21.03 43730
995 47-2082 Tapers detailed 12500 7.6 33.84 70390
1005 47-2142 Paperhangers detailed 1520 13.7 24.95 51900
1153 51-2021 Coil Winders, Tapers, and Finishers detailed 12170 4.0 23.53 48940
1170 51-3022 Meat, Poultry, and Fish Cutters and Trimmers detailed 141090 1.2 18.58 38640
1255 51-8011 Nuclear Power Reactor Operators detailed 5720 1.0 59.05 122830
1272 51-9021 Crushing, Grinding, and Polishing Machine Sett... detailed 28550 3.3 23.75 49400
1352 53-5000 Water Transportation Workers minor 77710 2.4 38.54 80150
In [159]:
broad_level_list_one.shape
Out[159]:
(18, 7)
Analysis¶

The above results show that we have isolated the rows that have:

  • duplicated tot_emp values
  • only one broad-level occ_code instance

Update To Do Items¶

  • Isolate duplicates in tot_emp by occ_code
  • Analyze duplicates
  • Remove duplicates

Compare Subgroups to Each Other¶

Let's check and see which rows in the overall 130 nat_data_filtered dataset have tot_emp values that match the above set.

In [160]:
# Look in the first and second filtered df's to see where the 'tot_emp' values match
nat_data_filtered[nat_data_filtered['tot_emp'].isin(broad_level_list_one['tot_emp'])].sort_values(by='tot_emp', ascending=False)
Out[160]:
occ_code occ_title o_group tot_emp emp_prse h_mean a_mean
1170 51-3022 Meat, Poultry, and Fish Cutters and Trimmers detailed 141090 1.2 18.58 38640
21 11-3013 Facilities Managers detailed 141090 0.7 55.06 114520
1352 53-5000 Water Transportation Workers minor 77710 2.4 38.54 80150
1206 51-4190 Miscellaneous Metal Workers and Plastic Workers broad 77710 1.8 22.84 47500
312 21-1013 Marriage and Family Therapists detailed 65870 4.0 34.96 72720
173 17-1020 Surveyors, Cartographers, and Photogrammetrists broad 65870 2.4 37.46 77920
822 41-9011 Demonstrators and Product Promoters detailed 64770 5.4 21.03 43730
145 15-1243 Database Architects detailed 64770 2.7 68.57 142620
503 27-4030 Television, Video, and Film Camera Operators a... broad 53330 3.8 39.35 81850
47 11-9039 Education Administrators, All Other detailed 53330 1.9 47.82 99460
282 19-3099 Social Scientists and Related Workers, All Other detailed 36970 1.8 51.17 106440
609 29-9099 Healthcare Practitioners and Technical Workers... detailed 36970 1.9 35.19 73200
1272 51-9021 Crushing, Grinding, and Polishing Machine Sett... detailed 28550 3.3 23.75 49400
284 19-4010 Agricultural and Food Science Technicians broad 28550 3.0 25.01 52030
494 27-3099 Media and Communication Workers, All Other detailed 23590 7.7 40.81 84870
692 35-2019 Cooks, All Other detailed 23590 7.4 18.27 38000
183 17-2040 Chemical Engineers broad 20330 3.7 61.75 128430
1191 51-4051 Metal-Refining Furnace Operators and Tenders detailed 20330 2.4 27.55 57290
286 19-4013 Food Science Technicians detailed 14200 3.0 26.15 54400
406 25-2023 Career/Technical Education Teachers, Middle Sc... detailed 14200 3.6 * 68690
222 17-3025 Environmental Engineering Technologists and Te... detailed 12500 6.4 30.32 63070
995 47-2082 Tapers detailed 12500 7.6 33.84 70390
1303 51-9191 Adhesive Bonding Machine Operators and Tenders detailed 12170 3.4 21.96 45670
1153 51-2021 Coil Winders, Tapers, and Finishers detailed 12170 4.0 23.53 48940
333 21-2099 Religious Workers, All Other detailed 12170 9.0 23.96 49830
332 21-2090 Miscellaneous Religious Workers broad 12170 9.0 23.96 49830
451 27-1013 Fine Artists, Including Painters, Sculptors, a... detailed 10000 7.7 36.76 76450
672 33-9031 Gambling Surveillance Officers and Gambling In... detailed 10000 2.1 22.46 46710
260 19-2043 Hydrologists detailed 5720 3.9 47.18 98130
1255 51-8011 Nuclear Power Reactor Operators detailed 5720 1.0 59.05 122830
273 19-3040 Sociologists broad 2950 7.2 53.69 111670
760 39-4012 Crematory Operators detailed 2950 5.7 21.53 44790
1350 53-4090 Miscellaneous Rail Transportation Workers broad 1520 10.3 25.72 53500
1351 53-4099 Rail Transportation Workers, All Other detailed 1520 10.3 25.72 53500
1005 47-2142 Paperhangers detailed 1520 13.7 24.95 51900
In [161]:
nat_data_filtered[nat_data_filtered['tot_emp'].isin(broad_level_list_one['tot_emp'])].shape
Out[161]:
(35, 7)

Of the 35 rows, most appear to not be duplicates.

However, a visual scan shows some issues. Note that Miscellaneous Rail Transportation Workers and Rail Transportation Workers, All Other are duplicates, as are religious workers.

Compare Against emp_prse Instead¶

We see when we look at these we see that, in both instances, the associated emp_prse are all equal. (This is also true for other columns, such as h_mean and a_mean.)

We can use this to further limit our duplicates at this point:

  • within a grouping by tot_emp duplicates, if the values in emp_prse do not also match, they're not duplicates.

The filtered dataset is supposed to be a list of duplicates, therefore these non-duplicate rows should be dropped at this stage.

Update To Do Items¶

  • Isolate duplicates in tot_emp by occ_code
  • Isolate duplicates in tot_emp by emp_prse
  • Analyze duplicates
  • Remove duplicates

Create a Custom Function¶

In [162]:
# Create temporary drop and count markers for the function
count_marker = 'ct'
drop_marker = 'drop'
In [163]:
# Add the temporary columns to the filtered dataset

nat_data_filtered[count_marker] = 0
nat_data_filtered[drop_marker] = False
In [164]:
# Checks a df first by initial grouping, and then subgrouping to discern duplicates
# Updates provided df in place
# Requires initial dataframe have columns for marking counting within subgroups and a drop marker
def check_unique_within_subgroup(
    df,
    fir_grp,
    sec_grp,
    threshold,
    count_marker,
    drop_marker
):    
    # Iterate through the first group and separate by fir_grp
    for grp_num, group_df in df.groupby(fir_grp):
        
        # Debug:
        # print(f"Checking group: {grp_num}")

        # Iterate through sec_grp, count values, and set count of each to count marker column
        group_df[count_marker] = group_df.groupby(sec_grp)[sec_grp].transform('count')
        
        # In the drop_marker col, indicate whether this passes the threshold
        group_df[drop_marker] = group_df[count_marker] == threshold
        # print(group_df)

        # For loop to iterate through each value in the subgroup
        # and, in the parent df, set the associated count_marker and drop_marker
        for grp_idx, grp_row in group_df.iterrows():

            # Debug:
            # print(f"\ngrp_idx: {grp_idx},\ncount_marker: {grp_row[count_marker]},\ndrop_marker: {grp_row[drop_marker]}")
            
            # Debug:
            # print(f"\ndf tot_emp: {df.loc[grp_idx, fir_grp]}\ndf curr count: {df.loc[grp_idx, count_marker]}")

            # Set the count_marker in the parent df
            df.loc[grp_idx, count_marker] = grp_row[count_marker]

            # Set the drop_marker in the parent df
            df.loc[grp_idx, drop_marker] = grp_row[drop_marker]

Call the Custom Function¶

In [165]:
# Call the custom function on nat_data_filtered
check_unique_within_subgroup(nat_data_filtered, 'tot_emp', 'emp_prse', threshold, count_marker, drop_marker) 
In [166]:
# Check result
with pd.option_context('display.max_rows', None, 'display.max_colwidth', 32):
    print(nat_data_filtered[['occ_title', 'emp_prse', 'tot_emp', count_marker, drop_marker, 'a_mean']])
                            occ_title  emp_prse  tot_emp  ct   drop  a_mean
835   Supervisors of Office and Ad...       0.4  1495580   2  False   71560
836   First-Line Supervisors of Of...       0.4  1495580   2  False   71560
793   First-Line Supervisors of Sa...       0.4  1332180   2  False   59830
792      Supervisors of Sales Workers       0.4  1332180   2  False   59830
815   Sales Representatives of Ser...       0.7  1189330   2  False   81260
814   Miscellaneous Sales Represen...       0.7  1189330   2  False   81260
106   Miscellaneous Business Opera...       0.6  1128200   2  False   92380
107   Business Operations Speciali...       0.6  1128200   2  False   92380
968   Supervisors of Construction ...       0.6   806080   2  False   84500
969   First-Line Supervisors of Co...       0.6   806080   2  False   84500
1147  First-Line Supervisors of Pr...       0.3   685140   2  False   74540
1146  Supervisors of Production Wo...       0.3   685140   2  False   74540
73                Managers, All Other       0.5   630980   2  False  149890
72             Miscellaneous Managers       0.5   630980   2  False  149890
1313  Supervisors of Transportatio...       0.4   615680   2  False   65680
1314  First-Line Supervisors of Tr...       0.4   615680   2  False   65680
1072  First-Line Supervisors of Me...       0.5   600680   2  False   82930
1071  Supervisors of Installation,...       0.5   600680   2  False   82930
154   Computer Occupations, All Other       1.0   439380   2  False  116700
153   Miscellaneous Computer Occup...       1.0   439380   2  False  116700
714   Supervisors of Building and ...       1.4   298790   2  False   54580
1188                       Machinists       1.0   298790   1   True   57390
715   First-Line Supervisors of Bu...       1.4   298790   2  False   54580
941   Miscellaneous Office and Adm...       1.5   195890   2  False   48700
942   Office and Administrative Su...       1.5   195890   2  False   48700
423                            Tutors       1.8   174660   1   True   47780
716   First-Line Supervisors of Ho...       1.4   174660   1   True   51170
210              Engineers, All Other       1.2   150750   2  False  121720
209           Miscellaneous Engineers       1.2   150750   2  False  121720
896   Information and Record Clerk...       0.9   143910   2  False   50200
895   Miscellaneous Information an...       0.9   143910   2  False   50200
1170  Meat, Poultry, and Fish Cutt...       1.2   141090   1   True   38640
21                Facilities Managers       0.7   141090   1   True  114520
131   Financial Specialists, All O...       1.3   127450   2  False   93890
130   Miscellaneous Financial Spec...       1.3   127450   2  False   93890
425   Miscellaneous Teachers and I...       4.1   125010   2  False   71610
426   Teachers and Instructors, Al...       4.1   125010   2  False   71610
445   Educational Instruction and ...       1.1   114640   2  False   55700
444   Miscellaneous Educational In...       1.1   114640   2  False   55700
491   Miscellaneous Media and Comm...       2.8    89580   1   True   71560
711   Miscellaneous Food Preparati...       5.2    89580   2  False   35990
712   Food Preparation and Serving...       5.2    89580   2  False   35990
1241  Cabinetmakers and Bench Carp...       1.7    79540   1   True   47460
1126  Precision Instrument and Equ...       2.2    79540   1   True   65560
1352     Water Transportation Workers       2.4    77710   1   True   80150
1206  Miscellaneous Metal Workers ...       1.8    77710   1   True   47500
312    Marriage and Family Therapists       4.0    65870   1   True   72720
173   Surveyors, Cartographers, an...       2.4    65870   1   True   77920
822   Demonstrators and Product Pr...       5.4    64770   1   True   43730
145               Database Architects       2.7    64770   1   True  142620
789   Miscellaneous Personal Care ...       4.6    62390   2  False   37990
790   Personal Care and Service Wo...       4.6    62390   2  False   37990
503   Television, Video, and Film ...       3.8    53330   1   True   81850
47    Education Administrators, Al...       1.9    53330   1   True   99460
1339  Motor Vehicle Operators, All...       3.1    50330   2  False   40980
1338  Miscellaneous Motor Vehicle ...       3.1    50330   2  False   40980
520                      Optometrists       2.7    41890   1   True  140940
550                 Anesthesiologists       7.9    41890   1   True  336640
483   News Analysts, Reporters and...       2.2    41550   2  False  106030
484   News Analysts, Reporters, an...       2.2    41550   2  False  106030
860    Miscellaneous Financial Clerks       3.0    37030   2  False   55330
861       Financial Clerks, All Other       3.0    37030   2  False   55330
282   Social Scientists and Relate...       1.8    36970   1   True  106440
609   Healthcare Practitioners and...       1.9    36970   1   True   73200
944   Supervisors of Farming, Fish...       2.4    29530   2  False   63360
945   First-Line Supervisors of Fa...       2.4    29530   2  False   63360
1272  Crushing, Grinding, and Poli...       3.3    28550   1   True   49400
284   Agricultural and Food Scienc...       3.0    28550   1   True   52030
1401  Miscellaneous Material Movin...       4.4    25190   2  False   46650
1402  Material Moving Workers, All...       4.4    25190   2  False   46650
494   Media and Communication Work...       7.7    23590   1   True   84870
692                  Cooks, All Other       7.4    23590   1   True   38000
262    Physical Scientists, All Other       2.4    22580   2  False  123070
261   Miscellaneous Physical Scien...       2.4    22580   2  False  123070
1191  Metal-Refining Furnace Opera...       2.4    20330   1   True   57290
183                Chemical Engineers       3.7    20330   1   True  128430
1186  Lathe and Turning Machine To...       3.4    18970   1   True   50190
207               Petroleum Engineers       5.8    18970   1   True  153560
468   Athletes and Sports Competitors       7.3    14370   1   True  259750
233   Food Scientists and Technolo...       5.9    14370   1   True   92190
1058  Continuous Mining Machine Op...       3.6    14340   1   True   63920
285          Agricultural Technicians       4.9    14340   1   True   49680
406   Career/Technical Education T...       3.6    14200   1   True   68690
286          Food Science Technicians       3.0    14200   1   True   54400
61    Postmasters and Mail Superin...       0.0    13810   1   True   93760
1187  Milling and Planing Machine ...       4.8    13810   1   True   52940
507   Media and Communication Equi...       9.9    13020   2  False   78350
506   Miscellaneous Media and Comm...       9.9    13020   2  False   78350
222   Environmental Engineering Te...       6.4    12500   1   True   63070
995                            Tapers       7.6    12500   1   True   70390
1303  Adhesive Bonding Machine Ope...       3.4    12170   1   True   45670
1153  Coil Winders, Tapers, and Fi...       4.0    12170   1   True   48940
333      Religious Workers, All Other       9.0    12170   2  False   49830
332   Miscellaneous Religious Workers       9.0    12170   2  False   49830
1374  Miscellaneous Transportation...       4.7    10960   2  False   45040
1375  Transportation Workers, All ...       4.7    10960   2  False   45040
672   Gambling Surveillance Office...       2.1    10000   1   True   46710
451   Fine Artists, Including Pain...       7.7    10000   1   True   76450
462              Designers, All Other       9.7     9680   1   True   78000
1131  Precision Instrument and Equ...       4.9     9680   1   True   70420
472                           Dancers       9.6     9060   1   True       *
218   Aerospace Engineering and Op...       4.3     9060   1   True   86330
197   Marine Engineers and Naval A...       5.2     8440   1   True  116680
569                    Acupuncturists      19.5     8440   1   True   89750
247        Life Scientists, All Other       5.2     7320   2  False  101940
246     Miscellaneous Life Scientists       5.2     7320   2  False  101940
1251        Miscellaneous Woodworkers       6.5     6590   2  False   43730
1252           Woodworkers, All Other       6.5     6590   2  False   43730
1305  Cooling and Freezing Equipme...       3.5     6590   1   True   44290
1068  Miscellaneous Extraction Wor...       8.5     6070   2  False   54780
1069    Extraction Workers, All Other       8.5     6070   2  False   54780
517   Dentists, All Other Specialists      10.7     5900   1   True  246530
939            Statistical Assistants       5.7     5900   1   True   55470
1129  Musical Instrument Repairers...       5.2     5730   1   True   49020
1210  Tool Grinders, Filers, and S...       4.8     5730   1   True   51790
260                      Hydrologists       3.9     5720   1   True   98130
1255  Nuclear Power Reactor Operators       1.0     5720   1   True  122830
166   Miscellaneous Mathematical S...       6.5     4660   2  False   84700
167   Mathematical Science Occupat...       6.5     4660   2  False   84700
473                    Choreographers       9.4     3430   1   True   62020
1138                Commercial Divers      12.6     3430   1   True   78110
273                      Sociologists       7.2     2950   1   True  111670
760               Crematory Operators       5.7     2950   1   True   44790
1350  Miscellaneous Rail Transport...      10.3     1520   2  False   53500
1351  Rail Transportation Workers,...      10.3     1520   2  False   53500
1005                     Paperhangers      13.7     1520   1   True   51900
844   Communications Equipment Ope...       9.9     1390   2  False   55890
843   Miscellaneous Communications...       9.9     1390   2  False   55890
566                Pediatric Surgeons      23.4     1050   1   True  450810
269   Industrial-Organizational Ps...      24.7     1050   1   True  134400

Verification¶

A visual inspection shows that the logic appears to be functioning properly.

Duplicates are found for each row that has:

  • The same tot_emp value
  • The same emp_prse value

We can see that the a_mean values also match.

We are prepared to remove from the filtering dataset all values that have a count of 1, because they are not duplicates.

Prepare for Drop¶

Check that all values that return True in the drop_marker column are not duplicates.

We want to remove these from the filtered dataset, so that they will stay when we later remove the filtered dataset from the main df.

We want to check that when we return drop_marker == True values, either:

  • The tot_emp value prints only once
  • Or, if the tot_emp is a duplicate, the associated emp_prse values do not match
In [167]:
with pd.option_context('display.max_rows',None):
    print(nat_data_filtered[nat_data_filtered[drop_marker] == True].sort_values(by='tot_emp',ascending=False))
     occ_code                                          occ_title   o_group  \
1188  51-4040                                         Machinists     broad   
423   25-3040                                             Tutors     broad   
716   37-1011  First-Line Supervisors of Housekeeping and Jan...  detailed   
1170  51-3022       Meat, Poultry, and Fish Cutters and Trimmers  detailed   
21    11-3013                                Facilities Managers  detailed   
491   27-3090      Miscellaneous Media and Communication Workers     broad   
1241  51-7010                 Cabinetmakers and Bench Carpenters     broad   
1126  49-9060       Precision Instrument and Equipment Repairers     broad   
1352  53-5000                       Water Transportation Workers     minor   
1206  51-4190    Miscellaneous Metal Workers and Plastic Workers     broad   
312   21-1013                     Marriage and Family Therapists  detailed   
173   17-1020    Surveyors, Cartographers, and Photogrammetrists     broad   
822   41-9011                Demonstrators and Product Promoters  detailed   
145   15-1243                                Database Architects  detailed   
47    11-9039                Education Administrators, All Other  detailed   
503   27-4030  Television, Video, and Film Camera Operators a...     broad   
520   29-1040                                       Optometrists     broad   
550   29-1211                                  Anesthesiologists  detailed   
282   19-3099   Social Scientists and Related Workers, All Other  detailed   
609   29-9099  Healthcare Practitioners and Technical Workers...  detailed   
1272  51-9021  Crushing, Grinding, and Polishing Machine Sett...  detailed   
284   19-4010          Agricultural and Food Science Technicians     broad   
494   27-3099         Media and Communication Workers, All Other  detailed   
692   35-2019                                   Cooks, All Other  detailed   
1191  51-4051       Metal-Refining Furnace Operators and Tenders  detailed   
183   17-2040                                 Chemical Engineers     broad   
1186  51-4034  Lathe and Turning Machine Tool Setters, Operat...  detailed   
207   17-2170                                Petroleum Engineers     broad   
468   27-2021                    Athletes and Sports Competitors  detailed   
233   19-1012                  Food Scientists and Technologists  detailed   
285   19-4012                           Agricultural Technicians  detailed   
1058  47-5041                Continuous Mining Machine Operators  detailed   
286   19-4013                           Food Science Technicians  detailed   
406   25-2023  Career/Technical Education Teachers, Middle Sc...  detailed   
61    11-9130               Postmasters and Mail Superintendents     broad   
1187  51-4035  Milling and Planing Machine Setters, Operators...  detailed   
222   17-3025  Environmental Engineering Technologists and Te...  detailed   
995   47-2082                                             Tapers  detailed   
1303  51-9191     Adhesive Bonding Machine Operators and Tenders  detailed   
1153  51-2021                Coil Winders, Tapers, and Finishers  detailed   
672   33-9031  Gambling Surveillance Officers and Gambling In...  detailed   
451   27-1013  Fine Artists, Including Painters, Sculptors, a...  detailed   
462   27-1029                               Designers, All Other  detailed   
1131  49-9069  Precision Instrument and Equipment Repairers, ...  detailed   
472   27-2031                                            Dancers  detailed   
218   17-3021  Aerospace Engineering and Operations Technolog...  detailed   
569   29-1291                                     Acupuncturists  detailed   
197   17-2120              Marine Engineers and Naval Architects     broad   
1305  51-9193  Cooling and Freezing Equipment Operators and T...  detailed   
517   29-1029                    Dentists, All Other Specialists  detailed   
939   43-9110                             Statistical Assistants     broad   
1129  49-9063            Musical Instrument Repairers and Tuners  detailed   
1210  51-4194              Tool Grinders, Filers, and Sharpeners  detailed   
260   19-2043                                       Hydrologists  detailed   
1255  51-8011                    Nuclear Power Reactor Operators  detailed   
473   27-2032                                     Choreographers  detailed   
1138  49-9092                                  Commercial Divers  detailed   
273   19-3040                                       Sociologists     broad   
760   39-4012                                Crematory Operators  detailed   
1005  47-2142                                       Paperhangers  detailed   
566   29-1243                                 Pediatric Surgeons  detailed   
269   19-3032            Industrial-Organizational Psychologists  detailed   

      tot_emp  emp_prse  h_mean  a_mean  ct  drop  
1188   298790       1.0   27.59   57390   1  True  
423    174660       1.8   22.97   47780   1  True  
716    174660       1.4    24.6   51170   1  True  
1170   141090       1.2   18.58   38640   1  True  
21     141090       0.7   55.06  114520   1  True  
491     89580       2.8    34.4   71560   1  True  
1241    79540       1.7   22.82   47460   1  True  
1126    79540       2.2   31.52   65560   1  True  
1352    77710       2.4   38.54   80150   1  True  
1206    77710       1.8   22.84   47500   1  True  
312     65870       4.0   34.96   72720   1  True  
173     65870       2.4   37.46   77920   1  True  
822     64770       5.4   21.03   43730   1  True  
145     64770       2.7   68.57  142620   1  True  
47      53330       1.9   47.82   99460   1  True  
503     53330       3.8   39.35   81850   1  True  
520     41890       2.7   67.76  140940   1  True  
550     41890       7.9  161.85  336640   1  True  
282     36970       1.8   51.17  106440   1  True  
609     36970       1.9   35.19   73200   1  True  
1272    28550       3.3   23.75   49400   1  True  
284     28550       3.0   25.01   52030   1  True  
494     23590       7.7   40.81   84870   1  True  
692     23590       7.4   18.27   38000   1  True  
1191    20330       2.4   27.55   57290   1  True  
183     20330       3.7   61.75  128430   1  True  
1186    18970       3.4   24.13   50190   1  True  
207     18970       5.8   73.83  153560   1  True  
468     14370       7.3       *  259750   1  True  
233     14370       5.9   44.32   92190   1  True  
285     14340       4.9   23.88   49680   1  True  
1058    14340       3.6   30.73   63920   1  True  
286     14200       3.0   26.15   54400   1  True  
406     14200       3.6       *   68690   1  True  
61      13810       0.0   45.08   93760   1  True  
1187    13810       4.8   25.45   52940   1  True  
222     12500       6.4   30.32   63070   1  True  
995     12500       7.6   33.84   70390   1  True  
1303    12170       3.4   21.96   45670   1  True  
1153    12170       4.0   23.53   48940   1  True  
672     10000       2.1   22.46   46710   1  True  
451     10000       7.7   36.76   76450   1  True  
462      9680       9.7    37.5   78000   1  True  
1131     9680       4.9   33.86   70420   1  True  
472      9060       9.6    30.3       *   1  True  
218      9060       4.3    41.5   86330   1  True  
569      8440      19.5   43.15   89750   1  True  
197      8440       5.2    56.1  116680   1  True  
1305     6590       3.5   21.29   44290   1  True  
517      5900      10.7  118.52  246530   1  True  
939      5900       5.7   26.67   55470   1  True  
1129     5730       5.2   23.57   49020   1  True  
1210     5730       4.8    24.9   51790   1  True  
260      5720       3.9   47.18   98130   1  True  
1255     5720       1.0   59.05  122830   1  True  
473      3430       9.4   29.82   62020   1  True  
1138     3430      12.6   37.55   78110   1  True  
273      2950       7.2   53.69  111670   1  True  
760      2950       5.7   21.53   44790   1  True  
1005     1520      13.7   24.95   51900   1  True  
566      1050      23.4  216.74  450810   1  True  
269      1050      24.7   64.62  134400   1  True  
In [168]:
nat_data_filtered[nat_data_filtered[drop_marker] == True].shape
Out[168]:
(62, 9)
In [169]:
nat_data_filtered[nat_data_filtered[drop_marker] == True]['tot_emp'].value_counts()
Out[169]:
tot_emp
14340     2
174660    2
2950      2
3430      2
5720      2
5730      2
5900      2
8440      2
9060      2
9680      2
10000     2
12170     2
12500     2
13810     2
14200     2
1050      2
14370     2
53330     2
141090    2
79540     2
77710     2
65870     2
18970     2
64770     2
41890     2
36970     2
28550     2
23590     2
20330     2
6590      1
89580     1
1520      1
298790    1
Name: count, dtype: int64

Note that there are 4 values that are not have matching pairs in this subset. (We could guess that they are part of a set of 3 or another odd number, where the other rows are considered duplicated.)

Therefore, while there are 62 rows that we want to keep in our final nat_data dataset, only 58 of the values from this current subset will be considered duplicated. We should expect to see 29 as the returned value from ...duplicated().sum() at the final stage.

Remove Non-Duplicates From Filtered Dataframe¶

In [170]:
nat_data_filtered = nat_data_filtered[nat_data_filtered[drop_marker] == False]
In [171]:
nat_data_filtered
Out[171]:
occ_code occ_title o_group tot_emp emp_prse h_mean a_mean ct drop
835 43-1000 Supervisors of Office and Administrative Suppo... minor 1495580 0.4 34.4 71560 2 False
836 43-1010 First-Line Supervisors of Office and Administr... broad 1495580 0.4 34.4 71560 2 False
793 41-1010 First-Line Supervisors of Sales Workers broad 1332180 0.4 28.77 59830 2 False
792 41-1000 Supervisors of Sales Workers minor 1332180 0.4 28.77 59830 2 False
815 41-3091 Sales Representatives of Services, Except Adve... detailed 1189330 0.7 39.07 81260 2 False
... ... ... ... ... ... ... ... ... ...
167 15-2099 Mathematical Science Occupations, All Other detailed 4660 6.5 40.72 84700 2 False
1350 53-4090 Miscellaneous Rail Transportation Workers broad 1520 10.3 25.72 53500 2 False
1351 53-4099 Rail Transportation Workers, All Other detailed 1520 10.3 25.72 53500 2 False
844 43-2099 Communications Equipment Operators, All Other detailed 1390 9.9 26.87 55890 2 False
843 43-2090 Miscellaneous Communications Equipment Operators broad 1390 9.9 26.87 55890 2 False

68 rows × 9 columns

Remove Rows in Filtered Dataframe from Main National Dataframe¶

In [172]:
nat_data_pre_drop = nat_data.drop(nat_data_filtered.index)
In [173]:
nat_data_pre_drop.head()
Out[173]:
area area_title area_type prim_state naics naics_title i_group own_code occ_code occ_title ... h_median h_pct75 h_pct90 a_pct10 a_pct25 a_median a_pct75 a_pct90 annual hourly
0 99 U.S. 1 US 0 Cross-industry cross-industry 1235 00-0000 All Occupations ... 23.8 37.89 60.44 29990 36730 49500 78810 125720 NaN NaN
1 99 U.S. 1 US 0 Cross-industry cross-industry 1235 11-0000 Management Occupations ... 58.7 82.5 # 57010 79900 122090 171610 # NaN NaN
2 99 U.S. 1 US 0 Cross-industry cross-industry 1235 11-1000 Top Executives ... 50.48 81.01 # 47510 68800 104990 168490 # NaN NaN
3 99 U.S. 1 US 0 Cross-industry cross-industry 1235 11-1010 Chief Executives ... 99.24 # # 73710 126080 206420 # # NaN NaN
5 99 U.S. 1 US 0 Cross-industry cross-industry 1235 11-1020 General and Operations Managers ... 49.5 78.91 # 47420 67160 102950 164130 # NaN NaN

5 rows × 28 columns

In [174]:
nat_data_pre_drop.info()
<class 'pandas.core.frame.DataFrame'>
Index: 1070 entries, 0 to 1399
Data columns (total 28 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   area         1070 non-null   int64  
 1   area_title   1070 non-null   object 
 2   area_type    1070 non-null   int64  
 3   prim_state   1070 non-null   object 
 4   naics        1070 non-null   int64  
 5   naics_title  1070 non-null   object 
 6   i_group      1070 non-null   object 
 7   own_code     1070 non-null   int64  
 8   occ_code     1070 non-null   object 
 9   occ_title    1070 non-null   object 
 10  o_group      1070 non-null   object 
 11  tot_emp      1070 non-null   int64  
 12  emp_prse     1070 non-null   float64
 13  h_mean       1070 non-null   object 
 14  a_mean       1070 non-null   object 
 15  mean_prse    1070 non-null   float64
 16  h_pct10      1070 non-null   object 
 17  h_pct25      1070 non-null   object 
 18  h_median     1070 non-null   object 
 19  h_pct75      1070 non-null   object 
 20  h_pct90      1070 non-null   object 
 21  a_pct10      1070 non-null   object 
 22  a_pct25      1070 non-null   object 
 23  a_median     1070 non-null   object 
 24  a_pct75      1070 non-null   object 
 25  a_pct90      1070 non-null   object 
 26  annual       78 non-null     object 
 27  hourly       7 non-null      object 
dtypes: float64(2), int64(5), object(21)
memory usage: 242.4+ KB
In [175]:
nat_data_pre_drop.shape
Out[175]:
(1070, 28)

Previously, we had 1138 rows. We saw that nat_data_filtered resulted in 68 rows for deletion. Now, we have 1070 rows.

In [176]:
1138 - 68
Out[176]:
1070
In [177]:
nat_data_pre_drop['tot_emp'].duplicated().sum()
Out[177]:
np.int64(29)

There is our expected value of 29.

In [178]:
nat_data_pre_drop[nat_data_pre_drop['tot_emp'].duplicated() == True]
Out[178]:
area area_title area_type prim_state naics naics_title i_group own_code occ_code occ_title ... h_median h_pct75 h_pct90 a_pct10 a_pct25 a_median a_pct75 a_pct90 annual hourly
312 99 U.S. 1 US 0 Cross-industry cross-industry 1235 21-1013 Marriage and Family Therapists ... 30.66 40.87 53.66 42610 48600 63780 85020 111610 NaN NaN
406 99 U.S. 1 US 0 Cross-industry cross-industry 1235 25-2023 Career/Technical Education Teachers, Middle Sc... ... * * * 47090 55920 63620 78270 98430 True NaN
468 99 U.S. 1 US 0 Cross-industry cross-industry 1235 27-2021 Athletes and Sports Competitors ... * * * 24960 36750 62360 130770 # True NaN
472 99 U.S. 1 US 0 Cross-industry cross-industry 1235 27-2031 Dancers ... 23.97 37.43 53.66 * * * * * NaN True
503 99 U.S. 1 US 0 Cross-industry cross-industry 1235 27-4030 Television, Video, and Film Camera Operators a... ... 33.93 49.19 64.83 37600 49420 70570 102310 134840 NaN NaN
550 99 U.S. 1 US 0 Cross-industry cross-industry 1235 29-1211 Anesthesiologists ... # # # 124450 186680 # # # NaN NaN
566 99 U.S. 1 US 0 Cross-industry cross-industry 1235 29-1243 Pediatric Surgeons ... # # # 189720 # # # # NaN NaN
569 99 U.S. 1 US 0 Cross-industry cross-industry 1235 29-1291 Acupuncturists ... 37.57 51.5 76.22 41840 54090 78140 107120 158540 NaN NaN
609 99 U.S. 1 US 0 Cross-industry cross-industry 1235 29-9099 Healthcare Practitioners and Technical Workers... ... 30.78 43.75 61.22 37220 45250 64030 91000 127340 NaN NaN
672 99 U.S. 1 US 0 Cross-industry cross-industry 1235 33-9031 Gambling Surveillance Officers and Gambling In... ... 21.11 24.33 29.98 34020 37410 43900 50610 62360 NaN NaN
692 99 U.S. 1 US 0 Cross-industry cross-industry 1235 35-2019 Cooks, All Other ... 17.41 20.3 23.85 26430 31200 36210 42230 49600 NaN NaN
716 99 U.S. 1 US 0 Cross-industry cross-industry 1235 37-1011 First-Line Supervisors of Housekeeping and Jan... ... 22.85 29 35.67 34390 38600 47520 60330 74190 NaN NaN
760 99 U.S. 1 US 0 Cross-industry cross-industry 1235 39-4012 Crematory Operators ... 20.62 23.94 28.97 31970 35890 42880 49800 60260 NaN NaN
822 99 U.S. 1 US 0 Cross-industry cross-industry 1235 41-9011 Demonstrators and Product Promoters ... 18.25 24.09 29 30910 33860 37960 50100 60320 NaN NaN
939 99 U.S. 1 US 0 Cross-industry cross-industry 1235 43-9110 Statistical Assistants ... 24.73 30.59 38.18 38050 45970 51440 63620 79410 NaN NaN
995 99 U.S. 1 US 0 Cross-industry cross-industry 1235 47-2082 Tapers ... 31.11 39.9 51.04 43450 50780 64700 82990 106160 NaN NaN
1058 99 U.S. 1 US 0 Cross-industry cross-industry 1235 47-5041 Continuous Mining Machine Operators ... 30.47 36.95 40.59 41450 50850 63380 76850 84420 NaN NaN
1131 99 U.S. 1 US 0 Cross-industry cross-industry 1235 49-9069 Precision Instrument and Equipment Repairers, ... ... 32.25 39.44 48.43 44530 52090 67080 82030 100730 NaN NaN
1138 99 U.S. 1 US 0 Cross-industry cross-industry 1235 49-9092 Commercial Divers ... 29.39 45.11 73.36 39130 49370 61130 93840 152580 NaN NaN
1170 99 U.S. 1 US 0 Cross-industry cross-industry 1235 51-3022 Meat, Poultry, and Fish Cutters and Trimmers ... 18.13 20.89 23.41 29200 34050 37700 43450 48680 NaN NaN
1186 99 U.S. 1 US 0 Cross-industry cross-industry 1235 51-4034 Lathe and Turning Machine Tool Setters, Operat... ... 23.38 27.93 31.47 36200 41180 48620 58100 65450 NaN NaN
1187 99 U.S. 1 US 0 Cross-industry cross-industry 1235 51-4035 Milling and Planing Machine Setters, Operators... ... 23.22 29.46 36.06 36810 41650 48310 61290 75000 NaN NaN
1191 99 U.S. 1 US 0 Cross-industry cross-industry 1235 51-4051 Metal-Refining Furnace Operators and Tenders ... 26.81 31.29 38.59 38760 46550 55770 65070 80280 NaN NaN
1210 99 U.S. 1 US 0 Cross-industry cross-industry 1235 51-4194 Tool Grinders, Filers, and Sharpeners ... 23.54 28.95 35.64 33770 39520 48970 60210 74120 NaN NaN
1241 99 U.S. 1 US 0 Cross-industry cross-industry 1235 51-7010 Cabinetmakers and Bench Carpenters ... 22.12 25.97 30.4 34700 38240 46020 54010 63220 NaN NaN
1255 99 U.S. 1 US 0 Cross-industry cross-industry 1235 51-8011 Nuclear Power Reactor Operators ... 58.95 63.23 73.41 99300 107170 122610 131520 152690 NaN NaN
1272 99 U.S. 1 US 0 Cross-industry cross-industry 1235 51-9021 Crushing, Grinding, and Polishing Machine Sett... ... 22.54 28.1 31.72 35380 38990 46890 58440 65980 NaN NaN
1303 99 U.S. 1 US 0 Cross-industry cross-industry 1235 51-9191 Adhesive Bonding Machine Operators and Tenders ... 21.74 25.03 29.05 31290 36420 45210 52060 60420 NaN NaN
1352 99 U.S. 1 US 0 Cross-industry cross-industry 1235 53-5000 Water Transportation Workers ... 31.97 48.63 66.96 36960 48100 66490 101140 139270 NaN NaN

29 rows × 28 columns

We see that our data is prepared to be shifted to the main df.

In [179]:
nat_data = nat_data_pre_drop

Remove To Do Items¶

  • Isolate duplicates in tot_emp by occ_code
  • Isolate duplicates in tot_emp by emp_prse
  • Analyze duplicates
  • Remove duplicates

emp_prse Wrangle¶

The emp_prse column is numerical.

We won't be able to tell much about duplicates, but we can simply check over the df.

Update To Do Items¶

  • Analyze and inspect emp_prse
In [180]:
list_unique = nat_data['emp_prse'].unique()
In [181]:
temp_df = pd.DataFrame()
temp_df['emp_prse_count'] = pd.DataFrame(list_unique)
In [182]:
temp_df.sort_index(ascending=False)
Out[182]:
emp_prse_count
129 19.0
128 22.4
127 12.5
126 38.4
125 13.0
... ...
4 0.4
3 1.2
2 0.3
1 0.2
0 0.0

130 rows × 1 columns

In [183]:
temp_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 130 entries, 0 to 129
Data columns (total 1 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   emp_prse_count  130 non-null    float64
dtypes: float64(1)
memory usage: 1.1 KB
In [184]:
with pd.option_context('max_colwidth', None):
    print(nat_desc.loc[nat_desc['Field'] == 'emp_prse', 'Field Description'])
12    Percent relative standard error (PRSE) for the employment estimate. PRSE is a measure of sampling error, expressed as a percentage of the corresponding estimate. Sampling error occurs when values for a population are estimated from a sample survey of the population, rather than calculated from data for all members of the population. Estimates with lower PRSEs are typically more precise in the presence of sampling error.
Name: Field Description, dtype: object
Observations on emp_prse¶

This seems straight forward and needs no further intervention.

Update To Do Items¶

  • Analyze and inspect emp_prse

mean_prse Wrangle¶

We're going to go out of order and do mean_prse next, because the annual and hourly-related columns can all be grouped together for simultaneous analysis.

The analysis here is similar to emp_prse.

Update To Do Items¶

  • Analyze and inspect mean_prse
In [185]:
list_unique = nat_data['emp_prse'].unique()
In [186]:
temp_df = pd.DataFrame()
temp_df['mean_prse_count'] = pd.DataFrame(list_unique)
In [187]:
temp_df.sort_index(ascending=False)
Out[187]:
mean_prse_count
129 19.0
128 22.4
127 12.5
126 38.4
125 13.0
... ...
4 0.4
3 1.2
2 0.3
1 0.2
0 0.0

130 rows × 1 columns

In [188]:
temp_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 130 entries, 0 to 129
Data columns (total 1 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   mean_prse_count  130 non-null    float64
dtypes: float64(1)
memory usage: 1.1 KB
In [189]:
with pd.option_context('max_colwidth', None):
    print(nat_desc.loc[nat_desc['Field'] == 'mean_prse', 'Field Description'])
19    Percent relative standard error (PRSE) for the mean wage estimate. PRSE is a measure of sampling error, expressed as a percentage of the corresponding estimate. Sampling error occurs when values for a population are estimated from a sample survey of the population, rather than calculated from data for all members of the population. Estimates with lower PRSEs are typically more precise in the presence of sampling error.
Name: Field Description, dtype: object
In [190]:
temp_df.sort_index(ascending=False)
Out[190]:
mean_prse_count
129 19.0
128 22.4
127 12.5
126 38.4
125 13.0
... ...
4 0.4
3 1.2
2 0.3
1 0.2
0 0.0

130 rows × 1 columns

The mean_prse column is straightforward as well. There are no 0 or Null values.

Update To Do Items¶

  • Analyze and inspect mean_prse

Hourly and Annual Data Analysis¶

Initial Analysis and Grouping of Columns¶

We will analyze the hourly and annual-related columns as one group, first.

In [191]:
col_hr_an = [
    'h_mean',
    'a_mean',
    'mean_prse',
    'h_pct10',
    'h_pct25',
    'h_median',
    'h_pct75',
    'h_pct90',
    'a_pct10',
    'a_pct25',
    'a_median',
    'a_pct75',
    'a_pct90',
    'annual',
    'hourly'
]
In [192]:
nat_data[col_hr_an].info()
<class 'pandas.core.frame.DataFrame'>
Index: 1070 entries, 0 to 1399
Data columns (total 15 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   h_mean     1070 non-null   object 
 1   a_mean     1070 non-null   object 
 2   mean_prse  1070 non-null   float64
 3   h_pct10    1070 non-null   object 
 4   h_pct25    1070 non-null   object 
 5   h_median   1070 non-null   object 
 6   h_pct75    1070 non-null   object 
 7   h_pct90    1070 non-null   object 
 8   a_pct10    1070 non-null   object 
 9   a_pct25    1070 non-null   object 
 10  a_median   1070 non-null   object 
 11  a_pct75    1070 non-null   object 
 12  a_pct90    1070 non-null   object 
 13  annual     78 non-null     object 
 14  hourly     7 non-null      object 
dtypes: float64(1), object(14)
memory usage: 133.8+ KB
In [193]:
nat_data[col_hr_an].head(10)
Out[193]:
h_mean a_mean mean_prse h_pct10 h_pct25 h_median h_pct75 h_pct90 a_pct10 a_pct25 a_median a_pct75 a_pct90 annual hourly
0 32.66 67920 0.1 14.42 17.66 23.8 37.89 60.44 29990 36730 49500 78810 125720 NaN NaN
1 68.15 141760 0.2 27.41 38.42 58.7 82.5 # 57010 79900 122090 171610 # NaN NaN
2 67.24 139860 0.4 22.84 33.08 50.48 81.01 # 47510 68800 104990 168490 # NaN NaN
3 126.41 262930 0.9 35.44 60.61 99.24 # # 73710 126080 206420 # # NaN NaN
5 64 133120 0.4 22.8 32.29 49.5 78.91 # 47420 67160 102950 164130 # NaN NaN
7 * 67390 1.6 * * * * * 20380 29120 44810 80350 137820 True NaN
9 78.72 163740 0.3 35.25 48.43 69.48 98.52 # 73320 100740 144530 204920 # NaN NaN
10 71.76 149270 2.3 30.29 41.34 61.04 85.85 # 63000 85990 126960 178570 # NaN NaN
12 79.35 165050 0.3 35.13 48.65 71.57 99.24 # 73080 101200 148870 206410 # NaN NaN
13 82.46 171520 0.6 39.38 53.47 77.42 101.48 # 81900 111210 161030 211080 # NaN NaN
In [194]:
nat_data['h_pct90'][0]
Out[194]:
60.44
In [195]:
print(type(nat_data['h_pct90'][0]))
<class 'float'>
In [196]:
nat_data['h_pct90'][1]
Out[196]:
'#'
In [197]:
print(type(nat_data['h_pct90'][1]))
<class 'str'>
Commentary¶

We see that these columns have mixed types. For each of the columns in each dataset, we need to resolve this issue.

The majority of the annual and hourly columns are Null. We will need to inspect and decide what this data means.

Let's focus on the h_pct90 column, picked at random.

The first task is to find out how many types there are.

The second task is to find out how many unique types of strings there are.

Update To Do Items¶

  • Analyze types and particulars of strings
  • Replace or drop rows across all columns as necessary
  • Analyze and decide on course of action for hourly and annual columns
In [198]:
hr_an_desc = pd.DataFrame(nat_desc.loc[nat_desc['Field'].isin(col_hr_an),['Field', 'Field Description']])
In [199]:
with pd.option_context('max_colwidth', None):
    print(hr_an_desc)
        Field  \
17     h_mean   
18     a_mean   
19  mean_prse   
20    h_pct10   
21    h_pct25   
22   h_median   
23    h_pct75   
24    h_pct90   
25    a_pct10   
26    a_pct25   
27   a_median   
28    a_pct75   
29    a_pct90   
30     annual   
31     hourly   

                                                                                                                                                                                                                                                                                                                                                                                                                         Field Description  
17                                                                                                                                                                                                                                                                                                                                                                                                                        Mean hourly wage  
18                                                                                                                                                                                                                                                                                                                                                                                                                       Mean annual wage   
19  Percent relative standard error (PRSE) for the mean wage estimate. PRSE is a measure of sampling error, expressed as a percentage of the corresponding estimate. Sampling error occurs when values for a population are estimated from a sample survey of the population, rather than calculated from data for all members of the population. Estimates with lower PRSEs are typically more precise in the presence of sampling error.  
20                                                                                                                                                                                                                                                                                                                                                                                                             Hourly 10th percentile wage  
21                                                                                                                                                                                                                                                                                                                                                                                                             Hourly 25th percentile wage  
22                                                                                                                                                                                                                                                                                                                                                                                             Hourly median wage (or the 50th percentile)  
23                                                                                                                                                                                                                                                                                                                                                                                                             Hourly 75th percentile wage  
24                                                                                                                                                                                                                                                                                                                                                                                                             Hourly 90th percentile wage  
25                                                                                                                                                                                                                                                                                                                                                                                                             Annual 10th percentile wage  
26                                                                                                                                                                                                                                                                                                                                                                                                             Annual 25th percentile wage  
27                                                                                                                                                                                                                                                                                                                                                                                             Annual median wage (or the 50th percentile)  
28                                                                                                                                                                                                                                                                                                                                                                                                             Annual 75th percentile wage  
29                                                                                                                                                                                                                                                                                                                                                                                                             Annual 90th percentile wage  
30                                                                                                                                                                                       Contains "TRUE" if only annual wages are released. The OEWS program releases only annual wages for some occupations that typically work fewer than 2,080 hours per year, but are paid on an annual basis, such as teachers, pilots, and athletes.  
31                                                                                                                                                                            Contains "TRUE" if only hourly wages are released. The OEWS program releases only hourly wages for some occupations that typically work fewer than 2,080 hours per year and are paid on an hourly basis, such as actors, dancers, and musicians and singers.  

Everything here is straightforward.

Reviewing the Notes Dataframe¶

Now, we need to also look at the notes_df from earlier.

In [200]:
with pd.option_context('max_colwidth', None):
    print(notes_df)
                                                                                         Notes
34                                        *  = indicates that a wage estimate is not available
35                                **  = indicates that an employment estimate is not available
36       #  = indicates a wage equal to or greater than $115.00 per hour or $239,200 per year 
37  ~ =indicates that the percent of establishments reporting the occupation is less than 0.5%

Observing the Various Types Within One Row¶

Let's look at the various types that are found within one column, as a representation of the rest.

In [201]:
nat_data['h_pct90'].apply(type).unique()
Out[201]:
array([<class 'float'>, <class 'str'>, <class 'int'>], dtype=object)

We see float, str, int.

The float values are probably desirable data points and don't need to be changed.

The str values will be those values found in the notes.

The int values need further inspection. Likely, they are float's that were cast into int's during the import stage, or something similar.

Value Counting¶

In [202]:
nat_data['h_pct90'][nat_data['h_pct90'].apply(type) == str].unique()
Out[202]:
array(['#', '*'], dtype=object)
In [203]:
nat_data['h_pct90'][nat_data['h_pct90'].apply(type) == int].unique()
Out[203]:
array([110, 86, 61, 40, 60, 50, 20, 29, 54, 23, 35, 33], dtype=object)
In [204]:
nat_data['h_pct90'][nat_data['h_pct90'].apply(type) == int].value_counts()
Out[204]:
h_pct90
40     4
110    1
86     1
61     1
60     1
50     1
20     1
29     1
54     1
23     1
35     1
33     1
Name: count, dtype: int64

Analysis for int Values¶

The int values all seem straight forward. We will simply cast them back into float values.

Looking at the Highest Values¶
In [205]:
nat_num_only = pd.DataFrame(nat_data['h_pct90'][(nat_data['h_pct90'].apply(type) != str)])
In [206]:
nat_num_only['h_pct90'][nat_num_only['h_pct90'] > 99]
Out[206]:
17      104.16
28      105.35
36      105.76
46      102.12
57      105.33
117     102.79
137      111.6
145     100.96
147      99.92
149     101.66
156      99.24
187     107.61
207        110
232     113.34
264     102.27
269     107.97
339     101.39
342     104.11
449     101.64
474     100.11
476     105.44
509     102.34
537     102.35
810     103.47
1322    101.16
Name: h_pct90, dtype: object

Analysis for str Values¶

We have a conundrum.

Regarding the # value, this str is prevalant in many of the columns.

The representation is important and cannot be dropped. At the same time, it is non-specific.

What I will do for now is simply insert the represented lowest value of $115/hr.

The same will go for the annual values, with the value being $239,200/year.

We will need to adjust the method of analysis whenever these values arise. In many cases, these values may simply have to be dropped from consideration.

Since there are a significant number of these values, the rows may not be able to be dropped altogether.

We may need to create two groups: a normal group and a high-income group.

Analyzing * Symbols¶

The next question: what to do about the * symbols?

In [207]:
temp_df = nat_data[nat_data['h_pct90'] == '*']
In [208]:
temp_df[col_hr_an].head()
Out[208]:
h_mean a_mean mean_prse h_pct10 h_pct25 h_median h_pct75 h_pct90 a_pct10 a_pct25 a_median a_pct75 a_pct90 annual hourly
7 * 67390 1.6 * * * * * 20380 29120 44810 80350 137820 True NaN
45 * 113360 0.4 * * * * * 72400 83840 104070 132550 165820 True NaN
350 * 102410 0.5 * * * * * 47010 61300 81600 125900 180630 True NaN
351 * 113840 0.6 * * * * * 46460 63040 97270 140360 210530 True NaN
353 * 98400 0.6 * * * * * 48470 62160 82390 124840 168080 True NaN
In [209]:
type(nat_data.loc[nat_data['h_pct90'] == '*','a_pct10'].values[0])
Out[209]:
int
Commentary¶

An initial inspection shows that the hourly columns are at least someimtes blank when the annual columns have values.

That raises the question, how many instances exist where hourly is * and annual is something other than an int?

In [210]:
# Lambda logic to obtain a_pct90 instances of non-int values 
nat_data[nat_data['a_pct90'].apply(lambda x: isinstance(x, int)) == False].head(2)
Out[210]:
area area_title area_type prim_state naics naics_title i_group own_code occ_code occ_title ... h_median h_pct75 h_pct90 a_pct10 a_pct25 a_median a_pct75 a_pct90 annual hourly
1 99 U.S. 1 US 0 Cross-industry cross-industry 1235 11-0000 Management Occupations ... 58.7 82.5 # 57010 79900 122090 171610 # NaN NaN
2 99 U.S. 1 US 0 Cross-industry cross-industry 1235 11-1000 Top Executives ... 50.48 81.01 # 47510 68800 104990 168490 # NaN NaN

2 rows × 28 columns

In [211]:
nat_data[nat_data['a_pct90'].apply(lambda x: isinstance(x, int)) == False].shape
Out[211]:
(67, 28)
In [212]:
# Logic to obtain h_pct90 is * and a_pct is not #
(nat_data['h_pct90'] == '*') & (nat_data['a_pct90'] != '#')
Out[212]:
0       False
1       False
2       False
3       False
5       False
        ...  
1394    False
1395    False
1396    False
1397    False
1399    False
Length: 1070, dtype: bool
In [213]:
# Test that instances are found
((nat_data['h_pct90'] == '*') & (nat_data['a_pct90'] != '#')).sum()
Out[213]:
np.int64(71)
In [214]:
# Combine the two together
nat_data[(nat_data['h_pct90'] == '*') & (nat_data['a_pct90'] != '#') & (nat_data['a_pct90'].apply(lambda x: isinstance(x, int)) == False)]
Out[214]:
area area_title area_type prim_state naics naics_title i_group own_code occ_code occ_title ... h_median h_pct75 h_pct90 a_pct10 a_pct25 a_median a_pct75 a_pct90 annual hourly

0 rows × 28 columns

Commentary¶

A preliminary, quick study shows that if the value in an hourly column is *, it is possible that no similar value matches in the associated annual column.

This search is needs to be expanded across all hourly and annual columns.

The h_mean and a_mean columns are likely composites of the two different types of numbers.

Logic stands to reason that we can simplify our search to see wehther these are ever both equal to * at the same time. If not, then we can assume, for now, that the two types of columns do not conflict.

Comparing h_mean and a_mean¶

In [215]:
nat_data[(nat_data['h_mean'] == '*') & (nat_data['a_mean'] == '*')]
Out[215]:
area area_title area_type prim_state naics naics_title i_group own_code occ_code occ_title ... h_median h_pct75 h_pct90 a_pct10 a_pct25 a_median a_pct75 a_pct90 annual hourly

0 rows × 28 columns

Commentary¶

There appears to be no overlap between the two columns.

Analyzing the hourly and annual Columns¶

The next question is whether there are rows where both hourly and annual values are provided.

In [216]:
nat_data[(nat_data['h_mean'].apply(lambda x: isinstance(x, int)) == True) & (nat_data['a_mean'].apply(lambda x: isinstance(x, int)) == True)].shape
Out[216]:
(10, 28)
In [217]:
temp_df = nat_data[(nat_data['h_mean'].apply(lambda x: isinstance(x, int)) == True) & (nat_data['a_mean'].apply(lambda x: isinstance(x, int)) == True)]
In [218]:
temp_col = ['occ_title'] + col_hr_an
temp_df[temp_col]
Out[218]:
occ_title h_mean a_mean mean_prse h_pct10 h_pct25 h_median h_pct75 h_pct90 a_pct10 a_pct25 a_median a_pct75 a_pct90 annual hourly
5 General and Operations Managers 64 133120 0.4 22.8 32.29 49.5 78.91 # 47420 67160 102950 164130 # NaN NaN
189 Electrical and Electronics Engineers 60 124810 0.6 36.63 44.42 57.11 72.37 87.21 76190 92390 118780 150530 181390 NaN NaN
346 Miscellaneous Legal Support Workers 35 72800 1.0 18.35 22.57 29.11 39.13 62.75 38170 46950 60550 81390 130510 NaN NaN
791 Sales and Related Occupations 26 54070 0.3 13.01 14.91 18.01 28.79 47.53 27060 31010 37460 59880 98860 NaN NaN
920 Legal Secretaries and Administrative Assistants 29 60320 2.6 17.08 20.54 26.03 34.66 42.15 35530 42720 54140 72090 87660 NaN NaN
924 Data Entry and Information Processing Workers 21 43680 0.5 15 17.19 19.94 23.7 28.47 31200 35760 41480 49300 59220 NaN NaN
950 Animal Breeders 27 56150 5.4 17.85 20.82 25 28.67 43.26 37130 43310 52000 59630 89970 NaN NaN
1121 Millwrights 33 68640 1.0 21.68 26.65 31.33 38.74 45.07 45100 55420 65170 80580 93740 NaN NaN
1266 Plant and System Operators, All Other 31 64490 1.6 19.35 23.01 29.67 36.98 44.59 40250 47860 61710 76910 92750 NaN NaN
1385 Industrial Truck and Tractor Operators 23 47830 0.2 17.55 19.13 22.3 25.81 29.59 36500 39780 46390 53680 61540 NaN NaN

The answer appears to be, yes. There are 10 rows where both annual and hourly data is provided.

More information will be needed to understand why these industries' parameters allow for both types of pay.

Find Instances of ~ and **¶

In [219]:
nat_data_mask = nat_data[nat_data.isin(['~']).any(axis=1)]
In [220]:
nat_data_mask.shape
Out[220]:
(0, 28)
In [221]:
nat_data_max = nat_data[nat_data.isin(['**']).any(axis=1)]
In [222]:
nat_data_mask.shape
Out[222]:
(0, 28)

There appears to be no instances of these two symbols in the National Dataset

Update To Do Items¶

  • Analyze types and particulars of strings
  • Replace or drop rows across all columns as necessary:
    • In all hourly data rows, insert 115 for all #
    • In all annual data rows, insert 239,200 for all #
    • In all annual and hourly rows, insert null for all *
  • Analyze and decide on course of action for hourly and annual columns
  • Research issue where 10 rows have both annual and hourly data

h_mean¶

The h_mean column contains * strings that need to be converted to NaN, and int values that need to be converted to float.

Converting str to NaN¶

In [223]:
# List of columns to drop from the filtered dataset for simplicity
an_hr_col_drop = [
    'area',
    'area_title',
    'area_type',
    'prim_state',
    'naics',
    'naics_title',
    'i_group',
    'own_code',
    'occ_code',
    'o_group',
    'tot_emp',
    'emp_prse'
]
In [224]:
# Filter the primary dataset
nat_data_filtered = nat_data.drop(columns=an_hr_col_drop, axis=1)
In [225]:
# Capture a map of the types in the h_mean column
h_mean_df = pd.DataFrame(nat_data_filtered['h_mean'].map(type))
In [226]:
# Capture the index of all rows of h_mean that are of type str
h_mean_str_idx = h_mean_df[h_mean_df['h_mean'] == str].index
In [227]:
# Capture all str values in the h_mean col
nat_data_h_mean_str = pd.DataFrame(nat_data_filtered['h_mean'].loc[h_mean_str_idx])
In [228]:
nat_data_h_mean_str['h_mean'].value_counts()
Out[228]:
h_mean
*    78
Name: count, dtype: int64

There are 78 instances of * to be adjusted. There are no # instances.

In [229]:
# Capture the index of all `*` cells
h_mean_star_idx = nat_data_h_mean_str[nat_data_h_mean_str['h_mean'] == '*'].index
In [230]:
print(len(h_mean_star_idx))
78
In [231]:
# Set each of these cells to `NaN`
nat_data_filtered.loc[h_mean_star_idx, 'h_mean'] = np.nan
In [232]:
# Check that no type str values remain in the column
nat_data_filtered[nat_data_filtered['h_mean'].apply(type) == str].shape
Out[232]:
(0, 16)
In [233]:
# Check that each of the targeted cells is NaN 
nat_data_filtered['h_mean'].loc[h_mean_star_idx]
Out[233]:
7       NaN
45      NaN
350     NaN
351     NaN
353     NaN
       ... 
1317    NaN
1318    NaN
1319    NaN
1320    NaN
1324    NaN
Name: h_mean, Length: 78, dtype: object

Set all int to float¶

Now that all of the str values are set to NaN, all int values can be easily converted to float.

In [234]:
nat_data_filtered['h_mean'] = nat_data_filtered['h_mean'].astype(float)
In [235]:
nat_data_filtered['h_mean'].describe()
Out[235]:
count    992.000000
mean      36.408649
std       23.397692
min       14.820000
25%       22.847500
50%       30.115000
75%       41.217500
max      216.740000
Name: h_mean, dtype: float64
In [236]:
nat_data_filtered['h_mean'].isna().sum()
Out[236]:
np.int64(78)

Update nat_data¶

The primary nat_data dataset h_mean column is prepared to receive the filtered set's wrangled column.

In [237]:
nat_data['h_mean'] = nat_data_filtered['h_mean']
In [238]:
nat_data[nat_data['h_mean'].apply(type) == str].shape
Out[238]:
(0, 28)
In [239]:
nat_data[nat_data['h_mean'].apply(type) == int].shape
Out[239]:
(0, 28)
In [240]:
nat_data['h_mean'].isna().sum()
Out[240]:
np.int64(78)
In [241]:
nat_data['h_mean'].describe()
Out[241]:
count    992.000000
mean      36.408649
std       23.397692
min       14.820000
25%       22.847500
50%       30.115000
75%       41.217500
max      216.740000
Name: h_mean, dtype: float64

a_mean¶

The a_mean column contains * strings that need to be converted to NaN, and int values that need to be converted to float.

Converting * to str¶

In [242]:
# Filter the primary dataset
nat_data_filtered = nat_data.drop(columns=an_hr_col_drop, axis=1)
In [243]:
# Capture a map of the types in the a_mean column
a_mean_df = pd.DataFrame(nat_data_filtered['a_mean'].map(type))
In [244]:
# Capture the index of all rows of a_mean that are of type str
a_mean_str_idx = a_mean_df[a_mean_df['a_mean'] == str].index
In [245]:
# Capture all str values in the a_mean col
nat_data_a_mean_str = pd.DataFrame(nat_data_filtered['a_mean'].loc[a_mean_str_idx])
In [246]:
nat_data_a_mean_str['a_mean'].value_counts()
Out[246]:
a_mean
*    7
Name: count, dtype: int64

There are 7 instances of * and no instances of #.

In [247]:
# Capture the index of all `*` cells
a_mean_star_idx = nat_data_a_mean_str[nat_data_a_mean_str['a_mean'] == '*'].index
In [248]:
print(len(a_mean_star_idx))
7
In [249]:
# Set each of these cells to `NaN`
nat_data_filtered.loc[a_mean_star_idx, 'a_mean'] = np.nan
In [250]:
# Check that no type str values remain in the column
nat_data_filtered[nat_data_filtered['a_mean'].apply(type) == str].shape
Out[250]:
(0, 16)
In [251]:
# Check that each of the targeted cells is NaN 
nat_data_filtered['a_mean'].loc[a_mean_star_idx]
Out[251]:
465    NaN
472    NaN
474    NaN
476    NaN
477    NaN
478    NaN
479    NaN
Name: a_mean, dtype: object

Set all int to float¶

Now that all of the str values are set to NaN, all int values can be easily converted to float.

In [252]:
nat_data_filtered['a_mean'] = nat_data_filtered['a_mean'].astype(float)
In [253]:
nat_data_filtered['a_mean'].describe()
Out[253]:
count      1063.000000
mean      77213.170273
std       48508.978729
min       30830.000000
25%       48560.000000
50%       64220.000000
75%       89740.000000
max      450810.000000
Name: a_mean, dtype: float64
In [254]:
nat_data_filtered['a_mean'].isna().sum()
Out[254]:
np.int64(7)

Update nat_data¶

The primary nat_data dataset h_mean column is prepared to receive the filtered set's wrangled column.

In [255]:
nat_data['a_mean'] = nat_data_filtered['a_mean']
In [256]:
nat_data[nat_data['a_mean'].apply(type) == str].shape
Out[256]:
(0, 28)
In [257]:
nat_data[nat_data['a_mean'].apply(type) == int].shape
Out[257]:
(0, 28)
In [258]:
nat_data['a_mean'].isna().sum()
Out[258]:
np.int64(7)
In [259]:
nat_data['a_mean'].describe()
Out[259]:
count      1063.000000
mean      77213.170273
std       48508.978729
min       30830.000000
25%       48560.000000
50%       64220.000000
75%       89740.000000
max      450810.000000
Name: a_mean, dtype: float64

mean_prse¶

Initial Inspection¶

Let's first check the description.

In [260]:
with pd.option_context('display.max_colwidth', None):
    print(nat_desc[nat_desc['Field'] == 'mean_prse']['Field Description'])
19    Percent relative standard error (PRSE) for the mean wage estimate. PRSE is a measure of sampling error, expressed as a percentage of the corresponding estimate. Sampling error occurs when values for a population are estimated from a sample survey of the population, rather than calculated from data for all members of the population. Estimates with lower PRSEs are typically more precise in the presence of sampling error.
Name: Field Description, dtype: object
In [261]:
nat_data_filtered['mean_prse'].head()
Out[261]:
0    0.1
1    0.2
2    0.4
3    0.9
5    0.4
Name: mean_prse, dtype: float64

The column is of type float64, and therefore there are no string values.

In [262]:
nat_data_filtered['mean_prse'].describe()
Out[262]:
count    1070.000000
mean        1.224206
std         1.773071
min         0.000000
25%         0.400000
50%         0.700000
75%         1.300000
max        22.100000
Name: mean_prse, dtype: float64
In [263]:
nat_data_filtered['mean_prse'].isna().sum()
Out[263]:
np.int64(0)
In [264]:
len(nat_data_filtered[nat_data_filtered['mean_prse'] == 0])
Out[264]:
5
In [265]:
nat_data_filtered[nat_data_filtered['mean_prse'] == 0]
Out[265]:
occ_title h_mean a_mean mean_prse h_pct10 h_pct25 h_median h_pct75 h_pct90 a_pct10 a_pct25 a_median a_pct75 a_pct90 annual hourly
61 Postmasters and Mail Superintendents 45.08 93760.0 0.0 39.15 41.9 44.58 47.88 52.47 81430 87150 92730 99590 109140 NaN NaN
907 Postal Service Workers 28.71 59720.0 0.0 20.38 22.13 27.82 35.6 36.2 42390 46030 57870 74050 75300 NaN NaN
908 Postal Service Clerks 29.36 61070.0 0.0 20.48 26.64 29.63 35.6 35.6 42600 55410 61630 74050 74050 NaN NaN
909 Postal Service Mail Carriers 28.79 59880.0 0.0 20.38 22.13 27.64 36.2 36.96 42390 46030 57490 75300 76880 NaN NaN
910 Postal Service Mail Sorters, Processors, and P... 28.04 58310.0 0.0 20.48 22.78 27.18 35.08 35.6 42600 47380 56530 72970 74050 NaN NaN

Analysis¶

Apparently, there is no mean_prse sampling error for the Postal service. There may be a legitimate reason for this data.

There is no need to alter this column further.

h_pct10¶

Initial Inspection¶

In [266]:
with pd.option_context('display.max_colwidth', None):
    print(nat_desc[nat_desc['Field'] == 'h_pct10']['Field Description'])
20    Hourly 10th percentile wage
Name: Field Description, dtype: object
In [267]:
nat_data_filtered[nat_data_filtered['h_pct10'].apply(type) == str].shape
Out[267]:
(78, 16)
In [268]:
nat_data_filtered[nat_data_filtered['h_pct10'].apply(type) == str].head(2)
Out[268]:
occ_title h_mean a_mean mean_prse h_pct10 h_pct25 h_median h_pct75 h_pct90 a_pct10 a_pct25 a_median a_pct75 a_pct90 annual hourly
7 Legislators NaN 67390.0 1.6 * * * * * 20380 29120 44810 80350 137820 True NaN
45 Education Administrators, Kindergarten through... NaN 113360.0 0.4 * * * * * 72400 83840 104070 132550 165820 True NaN
In [269]:
nat_data_filtered[nat_data_filtered['h_pct10'] == '*'].shape
Out[269]:
(78, 16)
Analysis¶

There are 78 instances of str values and all are the * value.

All instances should be replaced with NaN.

Converting * to str¶

In [270]:
# Filter the primary dataset
nat_data_filtered = nat_data.drop(columns=an_hr_col_drop, axis=1)
In [271]:
# Capture a map of the types in the h_pct10 column
h_pct10_df = pd.DataFrame(nat_data_filtered['h_pct10'].map(type))
In [272]:
# Capture the index of all rows of h_pct10 that are of type str
h_pct10_str_idx = h_pct10_df[h_pct10_df['h_pct10'] == str].index
In [273]:
# Capture all str values in the h_pct10 col
nat_data_h_pct10_str = pd.DataFrame(nat_data_filtered['h_pct10'].loc[h_pct10_str_idx])
In [274]:
nat_data_h_pct10_str['h_pct10'].value_counts()
Out[274]:
h_pct10
*    78
Name: count, dtype: int64

There are 7 instances of * and no instances of #.

In [275]:
# Capture the index of all `*` cells
h_pct10_star_idx = nat_data_h_pct10_str[nat_data_h_pct10_str['h_pct10'] == '*'].index
In [276]:
print(len(h_pct10_star_idx))
78
In [277]:
# Set each of these cells to `NaN`
nat_data_filtered.loc[h_pct10_star_idx, 'h_pct10'] = np.nan
In [278]:
# Check that no type str values remain in the column
nat_data_filtered[nat_data_filtered['h_pct10'].apply(type) == str].shape
Out[278]:
(0, 16)
In [279]:
# Check that each of the targeted cells is NaN 
nat_data_filtered['h_pct10'].loc[h_pct10_star_idx]
Out[279]:
7       NaN
45      NaN
350     NaN
351     NaN
353     NaN
       ... 
1317    NaN
1318    NaN
1319    NaN
1320    NaN
1324    NaN
Name: h_pct10, Length: 78, dtype: object

Set all int to float¶

Now that all of the str values are set to NaN, all int values can be easily converted to float.

In [280]:
nat_data_filtered['h_pct10'] = nat_data_filtered['h_pct10'].astype(float)
In [281]:
nat_data_filtered['h_pct10'].describe()
Out[281]:
count    992.000000
mean      20.477671
std        8.139964
min        8.890000
25%       15.352500
50%       17.955000
75%       22.955000
max       91.210000
Name: h_pct10, dtype: float64
In [282]:
nat_data_filtered['h_pct10'].isna().sum()
Out[282]:
np.int64(78)

Update nat_data¶

The primary nat_data dataset h_pct10 column is prepared to receive the filtered set's wrangled column.

In [283]:
nat_data['h_pct10'] = nat_data_filtered['h_pct10']
In [284]:
nat_data[nat_data['h_pct10'].apply(type) == str].shape
Out[284]:
(0, 28)
In [285]:
nat_data[nat_data['h_pct10'].apply(type) == int].shape
Out[285]:
(0, 28)
In [286]:
nat_data['h_pct10'].isna().sum()
Out[286]:
np.int64(78)
In [287]:
nat_data['h_pct10'].describe()
Out[287]:
count    992.000000
mean      20.477671
std        8.139964
min        8.890000
25%       15.352500
50%       17.955000
75%       22.955000
max       91.210000
Name: h_pct10, dtype: float64

h_pct25¶

Initial Inspection¶

In [288]:
with pd.option_context('display.max_colwidth', None):
    print(nat_desc[nat_desc['Field'] == 'h_pct25']['Field Description'])
21    Hourly 25th percentile wage
Name: Field Description, dtype: object
In [289]:
nat_data_filtered[nat_data_filtered['h_pct25'].apply(type) == str].shape
Out[289]:
(79, 16)
In [290]:
nat_data_filtered[nat_data_filtered['h_pct25'].apply(type) == str].head(2)
Out[290]:
occ_title h_mean a_mean mean_prse h_pct10 h_pct25 h_median h_pct75 h_pct90 a_pct10 a_pct25 a_median a_pct75 a_pct90 annual hourly
7 Legislators NaN 67390.0 1.6 NaN * * * * 20380 29120 44810 80350 137820 True NaN
45 Education Administrators, Kindergarten through... NaN 113360.0 0.4 NaN * * * * 72400 83840 104070 132550 165820 True NaN
In [291]:
nat_data_filtered[nat_data_filtered['h_pct25'] == '*'].shape
Out[291]:
(78, 16)
In [292]:
nat_data_filtered[nat_data_filtered['h_pct25'] == '#'].shape
Out[292]:
(1, 16)
In [293]:
with pd.option_context('display.max_colwidth', None):
    print(notes_df)
                                                                                         Notes
34                                        *  = indicates that a wage estimate is not available
35                                **  = indicates that an employment estimate is not available
36       #  = indicates a wage equal to or greater than $115.00 per hour or $239,200 per year 
37  ~ =indicates that the percent of establishments reporting the occupation is less than 0.5%
Analysis¶

There are 78 instances of the * value.

All instances should be replaced with NaN.

There is 1 instance of the # value.

It should be replaced with 115.

Converting * to str¶

In [294]:
# Filter the primary dataset
nat_data_filtered = nat_data.drop(columns=an_hr_col_drop, axis=1)
In [295]:
# Capture a map of the types in the h_pct25 column
h_pct25_df = pd.DataFrame(nat_data_filtered['h_pct25'].map(type))
In [296]:
# Capture the index of all rows of h_pct25 that are of type str
h_pct25_str_idx = h_pct25_df[h_pct25_df['h_pct25'] == str].index
In [297]:
# Capture all str values in the h_pct25 col
nat_data_h_pct25_str = pd.DataFrame(nat_data_filtered['h_pct25'].loc[h_pct25_str_idx])
In [298]:
nat_data_h_pct25_str['h_pct25'].value_counts()
Out[298]:
h_pct25
*    78
#     1
Name: count, dtype: int64

There are 78 instances of * and one instance of #.

Setting * to NaN¶

In [299]:
# Capture the index of all `*` cells
h_pct25_star_idx = nat_data_h_pct25_str[nat_data_h_pct25_str['h_pct25'] == '*'].index
In [300]:
print(len(h_pct25_star_idx))
78
In [301]:
# Set each of these cells to `NaN`
nat_data_filtered.loc[h_pct25_star_idx, 'h_pct25'] = np.nan
In [302]:
# Check that `78` type str values are removed from the column
nat_data_filtered[nat_data_filtered['h_pct25'].apply(type) == str].shape
Out[302]:
(1, 16)

Setting # to 115

In [303]:
# Capture the index of all `#` cells
h_pct25_hash_idx = nat_data_h_pct25_str[nat_data_h_pct25_str['h_pct25'] == '#'].index
In [304]:
print(len(h_pct25_hash_idx))
1
In [305]:
# Set each of these cells to `115`
nat_data_filtered.loc[h_pct25_hash_idx, 'h_pct25'] = 115
Verify¶
In [306]:
# Check that each of the targeted cells is NaN 
nat_data_filtered['h_pct25'].loc[h_pct25_star_idx].unique()
Out[306]:
array([nan], dtype=object)
In [307]:
# Check that each of the targeted cells is 115 
nat_data_filtered['h_pct25'].loc[h_pct25_hash_idx].unique()
Out[307]:
array([115], dtype=object)

Set all int to float¶

Now that all of the str values are set to NaN or 115, all int values can be easily converted to float.

In [308]:
nat_data_filtered['h_pct25'] = nat_data_filtered['h_pct25'].astype(float)
In [309]:
nat_data_filtered['h_pct25'].describe()
Out[309]:
count    992.000000
mean      25.801018
std       13.398383
min       11.770000
25%       17.932500
50%       22.125000
75%       28.825000
max      115.000000
Name: h_pct25, dtype: float64
In [310]:
nat_data_filtered['h_pct25'].isna().sum()
Out[310]:
np.int64(78)

Update nat_data¶

The primary nat_data dataset h_pct25 column is prepared to receive the filtered set's wrangled column.

In [311]:
nat_data['h_pct25'] = nat_data_filtered['h_pct25']
In [312]:
nat_data[nat_data['h_pct25'].apply(type) == str].shape
Out[312]:
(0, 28)
In [313]:
nat_data[nat_data['h_pct25'].apply(type) == int].shape
Out[313]:
(0, 28)
In [314]:
nat_data['h_pct25'].isna().sum()
Out[314]:
np.int64(78)
In [315]:
nat_data['h_pct25'].describe()
Out[315]:
count    992.000000
mean      25.801018
std       13.398383
min       11.770000
25%       17.932500
50%       22.125000
75%       28.825000
max      115.000000
Name: h_pct25, dtype: float64

h_median¶

Initial Inspection¶

In [316]:
with pd.option_context('display.max_colwidth', None):
    print(nat_desc[nat_desc['Field'] == 'h_median']['Field Description'])
22    Hourly median wage (or the 50th percentile)
Name: Field Description, dtype: object
In [317]:
nat_data_filtered[nat_data_filtered['h_median'].apply(type) == str].shape
Out[317]:
(97, 16)
In [318]:
nat_data_filtered[nat_data_filtered['h_median'].apply(type) == str].head(2)
Out[318]:
occ_title h_mean a_mean mean_prse h_pct10 h_pct25 h_median h_pct75 h_pct90 a_pct10 a_pct25 a_median a_pct75 a_pct90 annual hourly
7 Legislators NaN 67390.0 1.6 NaN NaN * * * 20380 29120 44810 80350 137820 True NaN
45 Education Administrators, Kindergarten through... NaN 113360.0 0.4 NaN NaN * * * 72400 83840 104070 132550 165820 True NaN
In [319]:
nat_data_filtered[nat_data_filtered['h_median'] == '*'].shape
Out[319]:
(78, 16)
In [320]:
nat_data_filtered[nat_data_filtered['h_median'] == '#'].shape
Out[320]:
(19, 16)
In [321]:
with pd.option_context('display.max_colwidth', None):
    print(notes_df)
                                                                                         Notes
34                                        *  = indicates that a wage estimate is not available
35                                **  = indicates that an employment estimate is not available
36       #  = indicates a wage equal to or greater than $115.00 per hour or $239,200 per year 
37  ~ =indicates that the percent of establishments reporting the occupation is less than 0.5%
Analysis¶

There are 78 instances of the * value.

All instances should be replaced with NaN.

There are 19 instances of the # value.

These should be replaced with 115.

Converting * to str¶

In [322]:
# Filter the primary dataset
nat_data_filtered = nat_data.drop(columns=an_hr_col_drop, axis=1)
In [323]:
# Capture a map of the types in the h_median column
h_median_df = pd.DataFrame(nat_data_filtered['h_median'].map(type))
In [324]:
# Capture the index of all rows of h_median that are of type str
h_median_str_idx = h_median_df[h_median_df['h_median'] == str].index
In [325]:
# Capture all str values in the h_median col
nat_data_h_median_str = pd.DataFrame(nat_data_filtered['h_median'].loc[h_median_str_idx])
In [326]:
nat_data_h_median_str['h_median'].value_counts()
Out[326]:
h_median
*    78
#    19
Name: count, dtype: int64

There are 78 instances of * and 19 instances of #.

Setting * to NaN¶

In [327]:
# Capture the index of all `*` cells
h_median_star_idx = nat_data_h_median_str[nat_data_h_median_str['h_median'] == '*'].index
In [328]:
print(len(h_median_star_idx))
78
In [329]:
# Set each of these cells to `NaN`
nat_data_filtered.loc[h_median_star_idx, 'h_median'] = np.nan
In [330]:
# Check that `78` type str values are removed from the column
nat_data_filtered[nat_data_filtered['h_median'].apply(type) == str].shape
Out[330]:
(19, 16)

Setting # to 115

In [331]:
# Capture the index of all `#` cells
h_median_hash_idx = nat_data_h_median_str[nat_data_h_median_str['h_median'] == '#'].index
In [332]:
print(len(h_median_hash_idx))
19
In [333]:
# Set each of these cells to `115`
nat_data_filtered.loc[h_median_hash_idx, 'h_median'] = 115
Verify¶
In [334]:
# Check that each of the targeted cells is NaN 
nat_data_filtered['h_median'].loc[h_median_star_idx].unique()
Out[334]:
array([nan], dtype=object)
In [335]:
# Check that each of the targeted cells is 115 
nat_data_filtered['h_median'].loc[h_median_hash_idx].unique()
Out[335]:
array([115], dtype=object)

Set all int to float¶

Now that all of the str values are set to NaN or 115, all int values can be easily converted to float.

In [336]:
nat_data_filtered['h_median'] = nat_data_filtered['h_median'].astype(float)
In [337]:
nat_data_filtered['h_median'].describe()
Out[337]:
count    992.000000
mean      32.838407
std       18.393770
min       14.500000
25%       21.725000
50%       27.780000
75%       37.580000
max      115.000000
Name: h_median, dtype: float64
In [338]:
nat_data_filtered['h_median'].isna().sum()
Out[338]:
np.int64(78)

Update nat_data¶

The primary nat_data dataset h_median column is prepared to receive the filtered set's wrangled column.

In [339]:
nat_data['h_median'] = nat_data_filtered['h_median']
In [340]:
nat_data[nat_data['h_median'].apply(type) == str].shape
Out[340]:
(0, 28)
In [341]:
nat_data[nat_data['h_median'].apply(type) == int].shape
Out[341]:
(0, 28)
In [342]:
nat_data['h_median'].isna().sum()
Out[342]:
np.int64(78)
In [343]:
nat_data['h_median'].describe()
Out[343]:
count    992.000000
mean      32.838407
std       18.393770
min       14.500000
25%       21.725000
50%       27.780000
75%       37.580000
max      115.000000
Name: h_median, dtype: float64
In [344]:
nat_data[nat_data['h_median'] == 115].shape
Out[344]:
(19, 28)

Creating Custom Formulas¶

The last several columns have all proceeded in an identical manner, and there are several columns to go.

To speed up the process, let's create a customized function that can be used on each of the columns to set the *, #, and int values all to their appropriate counterparts, and cast the resulting column to float.

Create Formula¶

In [345]:
# Take a column and 
def wrangle_a_h_col(
    nat_data,
    col_drop,
    col_wrangle,
    hash_replacement
):
    # Set the filtered dataset to be a subset of the primary dataset, dropping unnecessary columns 
    temp_df = nat_data.drop(columns=col_drop, axis=1)

    # Capture a map of the types in the target column
    col_df = pd.DataFrame(temp_df[col_wrangle].map(type))

    # Capture the index of all rows of col_wrangle that are of type str
    col_str_idx = col_df[col_df[col_wrangle] == str].index

    # Capture all str values in col_wrangle
    nat_data_col_str = pd.DataFrame(temp_df[col_wrangle].loc[col_str_idx])

    # Debug:
    print('str val counts:\n')
    print(nat_data_col_str[col_wrangle].value_counts())

    # Capture the index of all `*` cells
    col_star_idx = nat_data_col_str[nat_data_col_str[col_wrangle] == '*'].index

    # Debug:
    print(len(h_median_star_idx))

    # Set each of these cells to `NaN`
    temp_df.loc[col_star_idx, col_wrangle] = np.nan

    # Debug:
    print('Resulting shape:\n')
    print(temp_df[temp_df[col_wrangle].apply(type) == str].shape)

    # Capture the index of all `#` cells
    col_hash_idx = nat_data_col_str[nat_data_col_str[col_wrangle] == '#'].index

    # Debug:
    print(len(col_hash_idx))

    # Set each of these cells to hash_replacement
    temp_df.loc[col_hash_idx, col_wrangle] = hash_replacement

    # Debug:
    print('Check: each of the targeted cells is NaN:\n') 
    print(temp_df[col_wrangle].loc[col_star_idx].unique())

    # Set to type float
    temp_df[col_wrangle] = temp_df[col_wrangle].astype(float)

    # Debug:
    print('Describe:\n')
    print(temp_df[col_wrangle].describe())

    # Debug:
    print('Check NaN:\n')
    print(temp_df[col_wrangle].isna().sum())

    return temp_df

With this function in place, continuing through the remaining columns should proceed quickly.

h_pct75¶

In [346]:
nat_data_filtered = wrangle_a_h_col(nat_data, an_hr_col_drop, 'h_pct75', 115)
str val counts:

h_pct75
*    78
#    26
Name: count, dtype: int64
78
Resulting shape:

(26, 16)
26
Check: each of the targeted cells is NaN:

[nan]
Describe:

count    992.000000
mean      41.339748
std       21.625436
min       17.000000
25%       26.125000
50%       35.965000
75%       48.940000
max      115.000000
Name: h_pct75, dtype: float64
Check NaN:

78
In [347]:
nat_data_filtered['h_pct75'].describe()
Out[347]:
count    992.000000
mean      41.339748
std       21.625436
min       17.000000
25%       26.125000
50%       35.965000
75%       48.940000
max      115.000000
Name: h_pct75, dtype: float64
In [348]:
nat_data['h_pct75'] = nat_data_filtered['h_pct75']

h_pct90¶

In [349]:
nat_data_filtered = wrangle_a_h_col(nat_data, an_hr_col_drop, 'h_pct90', 115)
str val counts:

h_pct90
*    78
#    53
Name: count, dtype: int64
78
Resulting shape:

(53, 16)
53
Check: each of the targeted cells is NaN:

[nan]
Describe:

count    992.000000
mean      51.208468
std       25.278038
min       17.300000
25%       31.507500
50%       44.585000
75%       63.637500
max      115.000000
Name: h_pct90, dtype: float64
Check NaN:

78
In [350]:
nat_data_filtered['h_pct90'].describe()
Out[350]:
count    992.000000
mean      51.208468
std       25.278038
min       17.300000
25%       31.507500
50%       44.585000
75%       63.637500
max      115.000000
Name: h_pct90, dtype: float64
In [351]:
nat_data['h_pct90'] = nat_data_filtered['h_pct90']

a_pct10¶

In [352]:
with pd.option_context('display.max_colwidth', None):
    print(notes_df)
                                                                                         Notes
34                                        *  = indicates that a wage estimate is not available
35                                **  = indicates that an employment estimate is not available
36       #  = indicates a wage equal to or greater than $115.00 per hour or $239,200 per year 
37  ~ =indicates that the percent of establishments reporting the occupation is less than 0.5%
In [353]:
nat_data_filtered = wrangle_a_h_col(nat_data, an_hr_col_drop, 'a_pct10', 239200)
str val counts:

a_pct10
*    7
Name: count, dtype: int64
78
Resulting shape:

(0, 16)
0
Check: each of the targeted cells is NaN:

[nan]
Describe:

count      1063.000000
mean      43009.416745
std       16659.119891
min       18500.000000
25%       32265.000000
50%       38030.000000
75%       48300.000000
max      189720.000000
Name: a_pct10, dtype: float64
Check NaN:

7
In [354]:
nat_data_filtered['a_pct10'].describe()
Out[354]:
count      1063.000000
mean      43009.416745
std       16659.119891
min       18500.000000
25%       32265.000000
50%       38030.000000
75%       48300.000000
max      189720.000000
Name: a_pct10, dtype: float64
In [355]:
nat_data['a_pct10'] = nat_data_filtered['a_pct10']

a_pct25¶

In [356]:
nat_data_filtered = wrangle_a_h_col(nat_data, an_hr_col_drop, 'a_pct25', 239200)
str val counts:

a_pct25
*    7
#    1
Name: count, dtype: int64
78
Resulting shape:

(1, 16)
1
Check: each of the targeted cells is NaN:

[nan]
Describe:

count      1063.000000
mean      54261.674506
std       27360.720901
min       24480.000000
25%       37535.000000
50%       46950.000000
75%       62075.000000
max      239200.000000
Name: a_pct25, dtype: float64
Check NaN:

7
In [357]:
nat_data_filtered['a_pct25'].describe()
Out[357]:
count      1063.000000
mean      54261.674506
std       27360.720901
min       24480.000000
25%       37535.000000
50%       46950.000000
75%       62075.000000
max      239200.000000
Name: a_pct25, dtype: float64
In [358]:
nat_data['a_pct25'] = nat_data_filtered['a_pct25']

a_median¶

In [359]:
nat_data_filtered = wrangle_a_h_col(nat_data, an_hr_col_drop, 'a_median', 239200)
str val counts:

a_median
#    19
*     7
Name: count, dtype: int64
78
Resulting shape:

(19, 16)
19
Check: each of the targeted cells is NaN:

[nan]
Describe:

count      1063.000000
mean      69326.698024
std       37891.227934
min       30160.000000
25%       45850.000000
50%       59280.000000
75%       80125.000000
max      239200.000000
Name: a_median, dtype: float64
Check NaN:

7
In [360]:
nat_data_filtered['a_median'].describe()
Out[360]:
count      1063.000000
mean      69326.698024
std       37891.227934
min       30160.000000
25%       45850.000000
50%       59280.000000
75%       80125.000000
max      239200.000000
Name: a_median, dtype: float64
In [361]:
nat_data['a_median'] = nat_data_filtered['a_median']

a_pct75¶

In [362]:
nat_data_filtered = wrangle_a_h_col(nat_data, an_hr_col_drop, 'a_pct75', 239200)
str val counts:

a_pct75
#    28
*     7
Name: count, dtype: int64
78
Resulting shape:

(28, 16)
28
Check: each of the targeted cells is NaN:

[nan]
Describe:

count      1063.000000
mean      87751.072437
std       45028.967280
min       35360.000000
25%       56150.000000
50%       76820.000000
75%      104390.000000
max      239200.000000
Name: a_pct75, dtype: float64
Check NaN:

7
In [363]:
nat_data_filtered['a_pct75'].describe()
Out[363]:
count      1063.000000
mean      87751.072437
std       45028.967280
min       35360.000000
25%       56150.000000
50%       76820.000000
75%      104390.000000
max      239200.000000
Name: a_pct75, dtype: float64
In [364]:
nat_data['a_pct75'] = nat_data_filtered['a_pct75']

a_pct90¶

In [365]:
nat_data_filtered = wrangle_a_h_col(nat_data, an_hr_col_drop, 'a_pct90', 239200)
str val counts:

a_pct90
#    60
*     7
Name: count, dtype: int64
78
Resulting shape:

(60, 16)
60
Check: each of the targeted cells is NaN:

[nan]
Describe:

count      1063.000000
mean     109360.479774
std       53278.632026
min       35970.000000
25%       67490.000000
50%       95750.000000
75%      137800.000000
max      239200.000000
Name: a_pct90, dtype: float64
Check NaN:

7
In [366]:
nat_data_filtered['a_pct90'].describe()
Out[366]:
count      1063.000000
mean     109360.479774
std       53278.632026
min       35970.000000
25%       67490.000000
50%       95750.000000
75%      137800.000000
max      239200.000000
Name: a_pct90, dtype: float64
In [367]:
nat_data['a_pct90'] = nat_data_filtered['a_pct90']

annual¶

Analysis¶

The annual column only has, comparatively, a few values.

In [368]:
nat_data_filtered = nat_data.drop(columns=an_hr_col_drop, axis=1)
In [369]:
nat_data_filtered['annual'].unique()
Out[369]:
array([nan, True], dtype=object)
In [370]:
nat_data_filtered['annual'].value_counts()
Out[370]:
annual
True    78
Name: count, dtype: int64
In [371]:
nat_data_filtered['annual'].isna().sum()
Out[371]:
np.int64(992)
In [372]:
with pd.option_context('display.max_colwidth', None):
    print(nat_desc[nat_desc['Field'] == 'annual']['Field Description'])
30    Contains "TRUE" if only annual wages are released. The OEWS program releases only annual wages for some occupations that typically work fewer than 2,080 hours per year, but are paid on an annual basis, such as teachers, pilots, and athletes.
Name: Field Description, dtype: object

Analysis¶

This column appears to be complete and not in need of wrangling at this time.

hourly¶

This column should be similar in nature to annual.

In [373]:
nat_data_filtered = nat_data.drop(columns=an_hr_col_drop, axis=1)
In [374]:
nat_data_filtered['hourly'].unique()
Out[374]:
array([nan, True], dtype=object)
In [375]:
nat_data_filtered['hourly'].value_counts()
Out[375]:
hourly
True    7
Name: count, dtype: int64
In [376]:
nat_data_filtered['hourly'].isna().sum()
Out[376]:
np.int64(1063)
In [377]:
with pd.option_context('display.max_colwidth', None):
    print(nat_desc[nat_desc['Field'] == 'hourly']['Field Description'])
31    Contains "TRUE" if only hourly wages are released. The OEWS program releases only hourly wages for some occupations that typically work fewer than 2,080 hours per year and are paid on an hourly basis, such as actors, dancers, and musicians and singers.
Name: Field Description, dtype: object

Update To Do Items¶

  • Analyze types and particulars of strings
  • Replace or drop rows across all columns as necessary:
    • In all hourly data rows, insert 115 for all #
    • In all annual data rows, insert 239,200 for all #
    • In all annual and hourly rows, insert null for all *
  • Analyze and decide on course of action for hourly and annual columns
  • Research issue where 10 rows have both annual and hourly data
    • This issue will likely need research from outside the dataset

Post Wrangle Analysis¶

We have now made observations and completed wrangling on each column in the nat_data dataset.

Each column should be prepared for insertion into the train_reward_compare postgresql database.

Hourly-related rows that had a # value were substituted with 115. This allows the column to be consistent, but needs to be managed during evaluation.

Annual-related rows that had a # value were substituted with 239200, and expectations should be similar.

Note on Other BLS Datasets¶

We have several datasets, and we have only focused on the National dataset.

To limit the amount of time we spend at this stage, we're deciding to go ahead and push this prepared dataframe to SQL. After we push, we will start a new notebook and shift our focus to wrangling the O*NET dataset.

In theory, time permitting, we can return and develop the other BLS datasets that are listed at the beginning of this notebook.

PostgreSQL Database Sync¶

Before inserting the dataframe into PostgreSQL, we need to first ensure that the schema and table are ready.

In [388]:
bls_schema_name = 'bls'
nat_data_table_name = 'nat_data'
In [390]:
# Check if the bls schema exists. If not, create it.
# Adapted from a Google search, 'sqlalchemy how to create a schema'
with engine.connect() as conn:
    if not inspect(conn).has_schema(bls_schema_name):
        conn.execute(CreateSchema(bls_schema_name))
        conn.commit()
    else:
        print(f'{bls_schema_name} already exists')
bls already exists
In [403]:
# Check if a table with the indicated name exists
# If it does, check that it has a matching column and data type structure
# If it does not, send error and pause script
# As needed, create table in its entirety
# Adapted from the response of a google search, 'slqalchemy how to create a schema'
# Adapted from google search, 'sqlalchemy check if table exists within a schema'
# Adapted from Google search, 'sqlalchemy how to create column with big serial primary key'
    
# Create an inspector
inspector = inspect(engine)

# Check for the table in a specific schema
# If schema is None, it checks the default schema (e.g., 'public' in Postgres)
table_exists = inspector.has_table(nat_data_table_name, schema=bls_schema_name)

print(f"{nat_data_table_name} table exists: {table_exists}")

if not table_exists:
    class Base(DeclarativeBase):
        pass

    class User(Base):
        __tablename__ = nat_data_table_name
        # To place this table in a specific schema:
        __table_args__ = {"schema": schema_name} 
        
        id: Mapped[int] = mapped_column(BigInteger, primary_key=True)
        area = Column(Integer)
        area_title = Column(String)
        area_type = Column(Integer)
        prim_state = Column(String)
        naics = Column(Integer)
        naics_title = Column(String)
        i_group = Column(String)
        own_code = Column(Integer)
        occ_code = Column(String)
        occ_title = Column(String)
        o_group = Column(String)
        tot_emp = Column(Integer)
        emp_prse = Column(Float)
        h_mean = Column(Float)
        a_mean = Column(Float)
        mean_prse = Column(Float)
        h_pct10 = Column(Float)
        h_pct25 = Column(Float)
        h_median = Column(Float)
        h_pct75 = Column(Float)
        h_pct90 = Column(Float)
        a_pct10 = Column(Float)
        a_pct25 = Column(Float)
        a_median = Column(Float)
        a_pct75 = Column(Float)
        a_pct90 = Column(Float)
        annual = Column(Boolean)
        hourly = Column(Boolean)
        
    
    # Create all tables defined in 'Base'
    Base.metadata.create_all(engine)
nat_data table exists: True

Insert nat_data Data¶

Only run the following cell once.

In [407]:
# Only run once
# If a need arises to reimport, drop the database using the PgAdmin4 interface and then reimport

# Insert each row from nat_data into the psql table
# nat_data.to_sql(
#    name=nat_data_table_name, 
#    con=engine, 
#    schema=bls_schema_name,
#    if_exists='append',      # options: 'fail', 'replace', 'append'
#    index=False
#)

Verify Data Received¶

In [408]:
with engine.connect() as conn:
    result = conn.execute(text("select count(*) from bls.nat_data"))
    print(result.fetchone())
(1070,)

Conclusion¶

With a brief visual inspect in PgAdmin4, we can see that the nat_data database properly imported.