Jason Greenberg

研究问题:从他们的本科生涯开始, 女性是否面临进入STEM相关领域的障碍或强烈反对进入STEM相关领域, 特别是科学和工程, due to their gender? 此外,美国境内的地区是否会影响这种性别差异?

不断变化的性别期望和社会对女性越来越平等的待遇,促使来自不同学科的研究人员开始分析,是什么原因导致了男女收入差距的持续存在. 许多因素影响工人的工资, including job industry, experience, inherent ability, lifestyle preferences, and performance. 这些预测因素中的许多都是主观的,难以衡量. 这次演讲不会集中解释男性收入中位数较高的原因, 而是将研究大学专业选择中的性别差异, 哪个是最终职业轨迹和收入的指标. 从2015年美国社区调查(American Community Survey)的本科专业数据中可以看出,对理工科专业的系统性性别偏好和对理工科专业的偏好是显而易见的. Moreover, 然而,在全国范围内,在州一级,科学和工程专业的男女人数之间存在着奇怪的差距, 科学和工程专业的男性学位持有者与女性学位持有者的相对比例表明,美国不同地区在本科阶段的科学领域面临着不同程度的性别差异.

In [1]:
library(ggplot2)
library(maps)
library(RColorBrewer)
Warning message:
'maps'包是在R版本3下构建的.3.3"

上面的这些包是运行支持本文论证的图形所必需的。.

In [2]:
bachelors <- read.csv("bachelors.csv", header = TRUE, stringsAsFactors = FALSE)
 dim(bachelors)
 head(bachelors)
  1. 581
  2. 291
GEO.idGEO.id2GEO.display.labelHC01_EST_VC01HC01_MOE_VC01HC02_EST_VC01HC02_MOE_VC01HC03_EST_VC01HC03_MOE_VC01HC04_EST_VC01...HC02_EST_VC27HC02_MOE_VC27HC03_EST_VC27HC03_MOE_VC27HC04_EST_VC27HC04_MOE_VC27HC05_EST_VC27HC05_MOE_VC27HC06_EST_VC27HC06_MOE_VC27
Id Id2 Geography Total; Estimate; Total population 25 years and over with a Bachelor's degree or higher Total; Margin of Error; Total population 25 years and over with a Bachelor's degree or higher Percent; Estimate; Total population 25 years and over with a Bachelor's degree or higher Percent; Margin of Error; Total population 25 years and over with a Bachelor's degree or higher Males; Estimate; Total population 25 years and over with a Bachelor's degree or higher Males; Margin of Error; Total population 25 years and over with a Bachelor's degree or higher Percent Males; Estimate; Total population 25 years and over with a Bachelor's degree or higher ... Percent; Estimate; DETAILED AGE - 65 years and over - Arts, Humanities and Others Percent; Margin of Error; DETAILED AGE - 65 years and over - Arts, Humanities and Others Males; Estimate; DETAILED AGE - 65 years and over - Arts, Humanities and Others Males; Margin of Error; DETAILED AGE - 65 years and over - Arts, Humanities and Others Percent Males; Estimate; DETAILED AGE - 65 years and over - Arts, Humanities and Others Percent Males; Margin of Error; DETAILED AGE - 65 years and over - Arts, Humanities and Others Females; Estimate; DETAILED AGE - 65 years and over - Arts, Humanities and Others Females; Margin of Error; DETAILED AGE - 65 years and over - Arts, Humanities and Others Percent Females; Estimate; DETAILED AGE - 65 years and over - Arts, Humanities and Others Percent Females; Margin of Error; DETAILED AGE - 65 years and over - Arts, Humanities and Others
0400000US01 1 Alabama 792876 14677 (X) (X) 366201 9531 (X) ... 17.8 1.6 14284 1696 16.4 1.8 14435 1926 19.4 2.4
0400000US02 2 Alaska 139416 4807 (X) (X) 67843 3222 (X) ... 18.9 3.6 1951 565 17.1 4.4 2090 580 20.9 5.2
0400000US04 4 Arizona 1257449 16239 (X) (X) 621477 10626 (X) ... 18.4 1 26996 2115 15.6 1.1 30518 2468 21.9 1.6
0400000US05 5 Arkansas 433381 7690 (X) (X) 198339 5631 (X) ... 15.8 1.7 6777 1168 14 2.4 7330 1299 17.9 2.6
0400000US06 6 California 8415690 37555 (X) (X) 4123037 23131 (X) ... 23.6 0.5 157982 5589 18.9 0.6 211503 6420 29.1 0.9

The "bachelors.csv”文件包含了2015年美国和波多黎各各个地理区域的男女学士学位持有者的信息. 由于这是本课程Problem Set 2使用的数据集,因此不需要进行主要的数据清理. 为了演示的目的, 只有各州的综合数据,而不是城市数据, rural, 或者使用城市特定数据.

In [3]:
desired_columns <- c(3, 4, 16, 28, 40, 52, 64)
desired_rows <- seq(2,53) #所有州,华盛顿特区和波多黎各
subsetTotal <- bachelors[desired_rows, desired_columns] 
colnames(subsetTotal) <- c("State","Total", "SciEng", "SciEngRelated", "Business", "Education", "HumArts")
dim(subsetTotal)
subsetTotal
  1. 52
  2. 7
StateTotalSciEngSciEngRelatedBusinessEducationHumArts
2Alabama 792876 232948 79424 182750 135842 161912
3Alaska 139416 50587 12854 22495 20385 33095
4Arizona 1257449 416610 119456 271892 183272 266219
5Arkansas 433381 126510 42437 95297 83480 85657
6California 8415690 3427565 674278 1586921 563581 2163345
7Colorado 1440776 553496 121176 286139 147138 332827
8Connecticut 948044 342674 80873 187022 100734 236741
9Delaware 201929 70332 21046 44771 28364 37416
10District of Columbia268345 125167 11869 33555 11676 86078
11Florida 4092338 1283693 414601 995869 589792 808383
12Georgia 2000113 641664 178239 485769 273021 421420
13Hawaii 309194 110773 27619 61107 41200 68495
14Idaho 276912 92786 30241 50346 45560 57979
15Illinois 2853540 932408 269244 619650 384915 647323
16Indiana 1088120 305153 134069 226250 185666 236982
17Iowa 556591 169312 54940 116686 103090 112563
18Kansas 599063 175589 64756 129649 105871 123198
19Kentucky 696174 203689 78840 137156 119107 157382
20Louisiana 718058 200267 88487 144280 121677 163347
21Maine 289553 102732 28277 39879 45189 73476
22Maryland 1591614 645663 138044 295580 157419 354908
23Massachusetts 1951689 780836 158139 364241 175035 473438
24Michigan 1870473 609561 194451 397446 277764 391251
25Minnesota 1284007 429808 123550 252884 185440 292325
26Mississippi 406599 100307 55009 84900 87213 79170
27Missouri 1140860 342021 113391 246469 184485 254494
28Montana 216174 71556 23757 35787 39252 45822
29Nebraska 372288 103636 40424 84020 71497 72711
30Nevada 463681 147490 42672 105846 61003 106670
31New Hampshire 334313 123948 31664 63342 42482 72877
32New Jersey 2318073 854811 195024 527062 259221 481955
33New Mexico 364462 127327 32978 57214 58421 88522
34New York 4778463 1623429 415315 895311 548835 1295573
35North Carolina 1991057 687074 182823 402340 269287 449533
36North Dakota 143403 40867 20868 27786 27259 26623
37Ohio 2115116 650627 233599 448383 342930 439577
38Oklahoma 630004 182916 61146 143828 122057 120057
39Oregon 901667 347808 79655 131021 100430 242753
40Pennsylvania 2641023 874686 276665 530252 396210 563210
41Rhode Island 238818 80596 22368 45589 29037 61228
42South Carolina 890241 283678 85236 197700 131685 191942
43South Dakota 154885 47184 15976 29897 33840 27988
44Tennessee 1151080 342842 117853 256625 171712 262048
45Texas 4955374 1719782 445895 1164460 632224 993013
46Utah 554712 180989 55741 104661 77687 135634
47Vermont 162072 60132 12850 18942 22690 47458
48Virginia 2102044 863355 159477 397325 201524 480363
49Washington 1670893 685505 146243 274943 166411 397791
50West Virginia 254414 74797 32048 44434 51203 51932
51Wisconsin 1112458 340019 125112 225032 182504 239791
52Wyoming 102034 35634 10228 14729 21140 20303
53Puerto Rico 590228 146944 63591 190620 112350 76723

这个数据框展示了2015年以来所有50个州25岁及以上的人获得学士学位的总数, Washington DC, and Puerto Rico. 包括五类专业. 美国人口普查局美国社区调查将科学和工程相关专业定义为包括护理, architecture, 并获得数学教师教育学位, 而科学和工程类别包括生物学, chemistry, physics, mathematics, computer science, and social science degrees.

In [5]:
colnames(subsetTotal) <- c("State","Total", "SciEng", "SciEngRelated", "Business", "Education", "HumArts")
bandNames <- colnames(subsetTotal[,3:7])
par(mfrow = c(3,2))
par(mar = c(0,0,0,0))
 for(j in 1:5){ 
    hist(as.numeric(subsetTotal[,j+2])/as.numeric(subsetTotal[,2]),breaks = seq(0,0.7, by=0.05),ylim = c(0,50),
        axes = FALSE, main = "", xlab = "", ylab = "", col = "grey")
    box()
    text(x = .33, y=40, label = bandNames[j])
 }

在开始分析男女在大学专业选择中的性别和地域偏见之前, 看到所有观察点包括男性和女性的综合数字的专业分布是有帮助的. 上面的视觉效果包括五个直方图,分别代表了大学专业类型的各个类别. x轴表示该专业学位持有者的百分比,范围从0到70%,由“breaks = seq(0)”定义,0.7, by=0.05)" code, 而y轴则以状态数表示频率, which ranges from 0 to 50. 每个条形代表5%的存储范围. 使用“par”和forloop生成分组的直方图集, 单个直方图标题通过使用“colnames”函数连接到原始数据框“subsetTotal”列名. 对于这个非地区特定的国家主要数据,所有学位持有者超过25岁, 科学和工程专业在总学位中所占比例最高. 这体现在较高的中位数水平上, over 20 states, 大约占各州所有学位的30%,而且相对均衡, bell-curve shaped distribution. Meanwhile, 科学和工程相关领域的中位数较低, 30个州有5%到10%的学位持有者拥有这类学位, and education major degrees, 大约有20个州拥有10%到15%的此类学位, signifies lower popularity.

In [6]:
Desired_columnsMale <- c(8, 20, 32, 44, 56, 68) #men totals
Desired_rowsMale <- seq(2,53) #all states
SubsetMale <- bachelors[Desired_rowsMale, Desired_columnsMale] 
colnames(SubsetMale) <- c("TotalMale", "SciEngMale", "SciEngRelatedMale", "BusinessMale", 
                          "EducationMale", "HumArtsMale")
bandNames <- colnames(SubsetMale[,-1])
par(mfrow = c(3,2))
par(mar = c(0,0,0,0))
 for(j in 1:5){ 
    hist(as.numeric(SubsetMale[,j+1])/as.numeric(SubsetMale[,1]),breaks = seq(0,0.7, by=0.05),ylim = c(0,50),
        axes = FALSE, main = "", xlab = "", ylab = "", col = "grey")
    box()
    text(x = .33, y=40, label = bandNames[j])
 }

Desired_columnsFemale <- c(12, 24, 36, 48, 60, 72) #women totals
Desired_rowsFemale <- seq(2,53) #all states
SubsetFemale <- bachelors[Desired_rowsFemale, Desired_columnsFemale] 
colnames(SubsetFemale) <- c("TotalFemale", "SciEngFemale", "SciEngRelatedFemale", "BusinessFemale", 
                            "EducationFemale", "HumArtsFemale")
bandNames <- colnames(SubsetFemale[,-1])
par(mfrow = c(3,2))
par(mar = c(0,0,0,0))
 for(j in 1:5){ 
    hist(as.numeric(SubsetFemale[,j+1])/as.numeric(SubsetFemale[,1]),breaks = seq(0,0.7, by=0.05),ylim = c(0,50),
        axes = FALSE, main = "", xlab = "", ylab = "", col = "grey")
    box()
    text(x = .33, y=40, label = bandNames[j])
 }

这两组按性别划分的五个直方图有助于显示男性和女性在专业选择频率上的差异. 在上述原始的非性别直方图集中使用了相同的编码技术和结构. This time, 新的子集“subsetMale”和“subsetFemale”被用来代替“subsetTotal”,,其中数据来自原始Excel文件中的特定性别列. 分布中最显著的差异之一存在于理工科专业的直方图中. 男性科学技术人员的分布中心比女性科学技术人员的分布中心高约20%. 没有其他大学专业类型面临这样的性别差异. 进一步的统计分析将有助于澄清具有科学和工程学位的男性人数与具有科学和工程学位的女性人数之间关系的某些方面.

In [128]:
desired_columns <- c(3,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72)
desired_rows <- seq(2,53) #all states
subset <- bachelors[desired_rows, desired_columns] 
colnames(subset) <- c("State","Total","TotalMale","TotalFemale",
                      "SciEngTotal", "SciEngMale", "SciEngFemale",
                      "SciEngRelatedTotal", "SciEngRelatedMale", "SciEngRelatedFemale",
                      "BusinessTotal","BusinessMale","BusinessFemale",
                      "EducationTotal","EducationMale","EducationFemale",
                     "HumanitiesTotal","HumanitiesMale","HumanitiesFemale")

subset$percentMaleTotal <- as.numeric(subset$TotalMale)/as.numeric(subset$Total)*100
subset$percentFemaleTotal <- as.numeric(subset$TotalFemale)/as.numeric(subset$Total)*100

subset$percentOfMaleInSciEng <- as.numeric(subset$SciEngMale)/as.numeric(subset$SciEngTotal)*100     这对百分比之和为100%
subset$percentOfFemaleInSciEng <- as.numeric(subset$SciEngFemale)/as.numeric(subset$SciEngTotal)*100

subset$percentSciEngMale <- as.numeric(subset$SciEngMale)/as.numeric(subset$TotalMale)*100           #而这对没有直接的关系
subset$percentSciEngFemale <- as.numeric(subset$SciEngFemale)/as.numeric(subset$TotalFemale)*100

subset$percentSciEngRelatedMale <- as.numeric(subset$SciEngRelatedMale)/as.numeric(subset$TotalMale)*100          
subset$percentSciEngRelatedFemale <- as.numeric(subset$SciEngRelatedFemale)/as.numeric(subset$TotalFemale)*100

subset$percentBusinessMale <- as.numeric(subset$BusinessMale)/as.numeric(subset$TotalMale)*100
subset$percentBusinessFemale <- as.numeric(subset$BusinessFemale)/as.numeric(subset$TotalFemale)*100

subset$percentEducationMale <- as.numeric(subset$EducationMale)/as.numeric(subset$TotalMale)*100
subset$percentEducationFemale <- as.numeric(subset$EducationFemale)/as.numeric(subset$TotalFemale)*100

subset$percentHumanitiesMale <- as.numeric(subset$HumanitiesMale)/as.numeric(subset$TotalMale)*100
subset$percentHumanitiesFemale <- as.numeric(subset$HumanitiesFemale)/as.numeric(subset$TotalFemale)*100

subset$SciEngRatio <- subset$percentSciEngFemale/subset$percentSciEngMale

head(subset)
StateTotalTotalMaleTotalFemaleSciEngTotalSciEngMaleSciEngFemaleSciEngRelatedTotalSciEngRelatedMaleSciEngRelatedFemale...percentSciEngFemalepercentSciEngRelatedMalepercentSciEngRelatedFemalepercentBusinessMalepercentBusinessFemalepercentEducationMalepercentEducationFemalepercentHumanitiesMalepercentHumanitiesFemaleSciEngRatio
2Alabama 792876 366201 426675 232948 146493 86455 79424 20420 59004 ... 20.26249 5.576173 13.82879 27.62663 19.12017 7.468576 25.42732 19.32518 21.36122 0.5065188
3Alaska 139416 67843 71573 50587 29875 20712 12854 3428 9426 ... 28.93829 5.052843 13.16977 18.57819 13.81946 9.125481 19.83150 23.20799 24.24098 0.6571582
4Arizona 1257449 621477 635972 416610 262739 153871 119456 34583 84873 ... 24.19462 5.564647 13.34540 25.39499 17.93601 7.067840 21.91087 19.69598 22.61310 0.5722941
5Arkansas 433381 198339 235042 126510 78847 47663 42437 9224 33213 ... 20.27850 4.650623 14.13067 27.88357 17.01526 8.265142 28.54256 19.44701 20.03302 0.5101041
6California8415690 4123037 4292653 3427565 2040615 1386950 674278 209270 465008 ... 32.30986 5.075628 10.83265 20.86023 16.93233 3.194441 10.06075 21.37669 29.86442 0.6528166
7Colorado 1440776 705075 735701 553496 331019 222477 121176 36832 84344 ... 30.24014 5.223841 11.46444 23.05031 16.80261 5.171365 15.04361 19.60642 26.44920 0.6441191

将男女各专业的学位持有者人数除以男女各专业的总学位持有者人数, 每个州学位持有者的百分比, major, and gender can be derived. 获得原始计数和相关数据对于更完整的分析很重要. 澄清上面第二对和第三对计算代码中的注释, “男性在科学领域的百分比”和“女性在科学领域的百分比”表示拥有科学和工程学位的男性和女性的百分比与两种性别学位持有者的总和相比. 这就是为什么这两个值之和为100%. Meanwhile, “percentSciEngMale”和“percentSciEngFemale”表示持有理工科学位的男性和女性与其他学位的比例, not the other gender, 这就是为什么这两个百分比加起来很可能不是100%. 这两种计算对于理解人与人之间的关系都很重要, women, and degree choice.

In [8]:
median(as.numeric(subset$TotalMale))
median(as.numeric(subset$TotalFemale))

median(as.numeric(subset$SciEngMale))
median(as.numeric(subset$SciEngFemale))
342900.5
412566.5
134348.5
84527

通过计算每个州男性和女性总学位持有者的中位数, 可以看出,在每个州,女性平均拥有更多的学位. 女性学位持有者与男性学位持有者的比例约为1.2 to 1.而女性理工科学位持有者与男性理工科学位持有者的比例约为1.0 to 1.59, 这表明即使总体上女性的平均水平更高, 平均而言,男性比女性拥有更多的科学和工程学位.

In [9]:
sum(as.numeric(subset$TotalMale))
sum(as.numeric(subset$TotalFemale))

sum(as.numeric(subset$SciEngMale))
sum(as.numeric(subset$SciEngFemale))
31870667
34961114
13925554
9244229

计算所有州计数之和的相同比率, 男女总比例约为1.1 to 1.,而理工科并行总比例约为1.0 to 1.5. Therefore, 对于国家中位数比率和总和比率, 一般来说,女性拥有更多的学位, 但男女理工科学位之间的相对差异更大,而且呈相反的关系.

In [10]:
median(subset$percentMaleTotal)
median(subset$percentFemaleTotal)

median(subset$percentOfMaleInSciEng )
median(subset$percentOfFemaleInSciEng)

median(subset$percentSciEngMale)
median(subset$percentSciEngFemale)
median(subset$percentSciEngMale)/median(subset$percentSciEngFemale)

median(subset$percentSciEngRelatedMale)/median(subset$percentSciEngRelatedFemale) 

median(subset$percentBusinessMale)/median(subset$percentBusinessFemale) 

median(subset$percentEducationMale)/median(subset$percentEducationFemale) 

median(subset$percentHumanitiesMale)/median(subset$percentHumanitiesFemale)
47.5113516955763
52.4886483044237
60.6014516113314
39.3985483886686
42.3009988693081
24.1244107783942
1.75345210533361
0.413230746804793
1.45666628645974
0.344464288170358
0.858303537506162

从百分比计算来看,女性大约有52%.2015年,25岁以上的人占所有学士学位的5%. Yet they only held about 39.科学和工程学位的4%. 与刚才检查的比率一致, 男性比女性更倾向于进入科学和工程领域. On average, 42.在所有学位中,有3%是基于科学和工程的,而男性只有24%.平均1%的女性学位是科学和工程学位. 这些百分比的比率是1.75, 在所有的大学专业中,哪个是最大的, 商科专业排名第二,为1.46. 学位总数之间的差异, state averages, 各州的平均百分比都表明,女性不进入科学或工程领域是一种社会趋势. 2009年的一篇论文分析了对西北大学161名学生的调查,通过基于调查结果的计量经济模型确定,女性决定不进入特定专业的最重要原因与对课程乐趣的期望有关. 作者认为,男性和女性对特定院系课程期望的差异可能与社会中的性别歧视有关(Zafar 29)。. 1984年的一篇文章使用了1972年全国高中班级纵向研究的数据, 作者认为,“在高中最后一年,他们的偏好出现了实质性的差异, 为各种类型的工作以及随后在大学期间为劳动力市场做准备”(Daymont 414)。. Again, 一个基于调查答案的经济回归模型确定了哪些因素在造成性别差距的最主要因素中占最大比重. 如果学生对职业选择的偏好和印象影响大学专业的选择,从而影响职业道路和收入, 那么,2015年美国社区调查(American Community Survey)的数据进一步支持了这样一种观点,即性别差距甚至在学生完成学业之前就开始形成了.

In [12]:
options(scipen=2000000) #将200万以下数字的科学记数法转换为常规小数
summary(lm(as.numeric(subset$SciEngFemale) ~ as.numeric(subset$SciEngMale)))
Call:
lm(formula = as.数值(子集$SciEngFemale) ~ as.numeric(subset$SciEngMale))

Residuals:
   最小1Q中位数3Q最大值 
-78668  -9063   -421   6422  97610 

Coefficients:
                                  Estimate   Std. Error t value
(拦截 )                   - 3391年.997658  4208.916304  -0.806
as.数字(子集SciEngMale美元)0.676498     0.009741  69.447
                                         Pr(>|t|)    
(拦截 )                                 0.424    
as.numeric(subset$SciEngMale) <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

残差标准误差:23820在50个自由度
Multiple R-squared:  0.9897,	Adjusted R-squared:  0.9895 
F-statistic:  4823 on 1 and 50 DF,  p-value: < 0.00000000000000022

上面的代码输出了一个线性回归摘要,该线性回归摘要是男性持有的科学和工程学位的原始计数,作用于每个州女性持有的科学和工程学位的原始计数, DC, and Puerto Rico. 然后对其他四种专业类型执行相同的汇总输出. 重要的汇总统计数据进一步分析了支持女性如何系统地避免主修科学或工程专业的证据.

In [13]:
summary(lm(as.numeric(subset$SciEngRelatedFemale) ~ as.numeric(subset$SciEngRelatedMale)))
Call:
lm(formula = as.数值(子集$SciEngRelatedFemale) ~ as.数字(子集SciEngRelatedMale美元))

Residuals:
   最小1Q中位数3Q最大值 
-30447  -5203  -2189   6499  34268 

Coefficients:
                                       Estimate Std. Error t value
(拦截 )                          7139年.04527 2119.41889   3.368
as.数字(子集SciEngRelatedMale美元)2.32544    0.04122  56.413
                                                 Pr(>|t|)    
(拦截 )                                       0.00146 ** 
as.numeric(subset$SciEngRelatedMale) < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

残差标准误差:在50自由度上为11470
Multiple R-squared:  0.9845,	Adjusted R-squared:  0.9842 
F-statistic:  3182 on 1 and 50 DF,  p-value: < 0.00000000000000022
In [14]:
summary(lm(as.numeric(subset$BusinessFemale) ~ as.numeric(subset$BusinessMale)))
Call:
lm(formula = as.数字(子集$BusinessFemale) ~ as.numeric(subset$BusinessMale))

Residuals:
   最小1Q中位数3Q最大值 
-42580  -3227    412   2940  56153 

Coefficients:
                                   Estimate  Std. Error t value
(拦截 )                     - 1341年.27859  2532.33558   -0.53
as.数字(子集BusinessMale美元)0.80450     0.01122   71.68
                                           Pr(>|t|)    
(拦截 )                                   0.599    
as.numeric(subset$BusinessMale) <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

残差标准误差:13810在50个自由度
Multiple R-squared:  0.9904,	Adjusted R-squared:  0.9902 
F-statistic:  5138 on 1 and 50 DF,  p-value: < 0.00000000000000022
In [15]:
summary(lm(as.numeric(subset$EducationFemale) ~ as.numeric(subset$EducationMale)))
Call:
lm(formula = as.数字(子集$EducationFemale) ~ as.numeric(subset$EducationMale))

Residuals:
   最小1Q中位数3Q最大值 
-62856  -8567  -1395   7101  82946 

Coefficients:
                                    Estimate  Std. Error t value
(拦截 )                      - 4432年.65497  4627.01232  -0.958
as.数字(子集EducationMale美元)3.35557     0.08909  37.666
                                            Pr(>|t|)    
(拦截 )                                    0.343    
as.numeric(subset$EducationMale) <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

残差标准误差:22270在50个自由度
Multiple R-squared:  0.966,	Adjusted R-squared:  0.9653 
F-statistic:  1419 on 1 and 50 DF,  p-value: < 0.00000000000000022
In [16]:
summary(lm(as.numeric(subset$HumanitiesFemale) ~ as.numeric(subset$HumanitiesMale)))
Call:
lm(formula = as.数值(子集$HumanitiesFemale) ~ as.数字(子集HumanitiesMale美元))

Residuals:
   最小1Q中位数3Q最大值 
-60060  -6516   2465   6738  59227 

Coefficients:
                                     Estimate  Std. Error t value
(拦截 )                       - 8813年.81804  2759.08699  -3.194
as.数字(子集HumanitiesMale美元)1.39733     0.01402  99.692
                                              Pr(>|t|)    
(拦截 )                                    0.00243 ** 
as.numeric(subset$HumanitiesMale) < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

残差标准误差:在50自由度上为15380
Multiple R-squared:  0.995,	Adjusted R-squared:  0.9949 
F-statistic:  9938 on 1 and 50 DF,  p-value: < 0.00000000000000022

每个专业的女性学位持有者数量对男性学位持有者数量的依赖关系在五个专业中都是正的. 看看学士学位持有者的原始数据,这个结果并不令人惊讶. 男女学位数量的增加意味着,在一个州,一种性别的学位持有者较多,另一种性别的学位持有者也相对较多. 而斜率最小的是理工科专业关系,为0.67,其他斜率为0.80, 1.40, 2.33, and 3.36. 基于每个回归的p值可忽略地接近于零, 回归系数均具有统计学显著性.

In [17]:
ggplot(subset, aes(as.numeric(SciEngMale), as.numeric(SciEngFemale)))  +
      geom_point()+
      #scale_x_continuous(name=“男性理科学位持有者总数”,限制=c(0,150000)) +
      #scale_y_continuous(name=“女性理科学位持有者总数”,限制=c(0,150000))+
      labs(x= “全国男性理科学历总人数”) +
      labs(y = 《买球平台》)+ ylim(0,2000000)+
      labs(title= “男女理科学位持有者的关系”) + 
      stat_smooth(method = lm, se = FALSE, color = "black") +
      geom_vline(xintercept = 134349, linetype="dotted", colour="red")+
      geom_hline(yintercept =  84527, linetype="dotted", colour="red")+
      geom_vline(xintercept = 0)+
      geom_hline(yintercept = 0)+
      annotate("text", label = "r^2 == 0.9895", parse = TRUE,x= 1400000, y = 1500000) +
      annotate("text", label = "slope = 0.676634", x= 1475000, y = 1250000)



#qplot(as.numeric(SciEngMale), as.numeric(SciEngFemale), data =子集,color = I(“深蓝色”),
# xlab =“男理科总人数”,ylab =“女理科总人数”, 
# main = "理科男女总人数关系")+ geom_smooth(method = "lm"), se = FALSE)
#qplot上面ggplot的版本

上面的ggplot可视化地表示了之前运行的五个回归系列中的第一个线性回归. x轴代表50个州中25岁及以上的男性科学或工程学士学位持有者的总数, DC, and Puerto Rico. y轴代表女性的相同数据. 拥有科学或工程专业学位的男性中位数为134人,349, 而女性的平均年龄是84岁,527. 虚线准星截距表示这一点. Also, r平方值表明,几乎99%的女性科学和工程学位持有者的变化是由男性科学或工程学位持有者数量的变化造成的. 对于只有一个自变量的回归,这个值高得令人难以置信, 但是看看其他的r平方值在其他大学主要类别的回归, 可以看出,存在类似的高值. 在数据失真的可能性之外, 这表明,2015年一个州某一特定专业的男性学位持有者数量是一个非常精确的指标,可以预测该专业的女性学位持有者数量.

In [18]:
summary(lm(subset$percentSciEngFemale ~ subset$percentSciEngMale))
ggplot(subset, aes(percentSciEngMale, percentSciEngFemale))  +
      geom_point()+
      labs(x= “按国家分男性理科学历”) + xlim(32.5,52.5)+ 
      labs(y = “各州持有理科学位的女性比例”)+ ylim(15.25,35.25)+
      labs(title= 《买球平台》) + 
      stat_smooth(method = lm, se = FALSE, color = "black")
Call:
lm(公式=子集$percentSciEngFemale ~子集$percentSciEngMale)

Residuals:
    最小1Q中位数3Q最大值 
-8.9425 -1.5519  0.5676  1.1299  7.3309 

Coefficients:
                          Estimate Std. Error t value            Pr(>|t|)    
(Intercept)              -17.19788    3.77537  -4.555 0.00003383556251666 ***
subset$percentSciEngMale   0.99056    0.08823  11.227 0.00000000000000284 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.589 on 50 degrees of freedom
Multiple R-squared:  0.716,	Adjusted R-squared:  0.7103 
F-statistic: 126.1 on 1 and 50 DF,  p-value: 0.000000000000002836
Warning message:
删除1行包含非有限值(stat_smooth)."Warning message:
"删除1行包含缺失值(geom_point)."

看看每个州主修科学和工程的学位持有者的百分比, as opposed to the raw counts, 可以看到一个相对一对一的斜率,正如在上面的回归中计算的,其值约为0.99. 这种关系乍一看可能与对原始计数的解释相冲突, 但由于观测值的差异, 这个斜率实际上进一步支持了这样一种观点,即从事科学和工程的女性人数明显低于男性. 理工科学士学位持有者的比例每增加一个百分点, 拥有理工科专业学位的女性比例预计也将增加1%. However, 女性的百分比值范围, as seen by the y-axis, 是否都比两种性别中位数州百分比值相差18%左右, 男性的中位数是42.3% for men and 24.1% for women. Importantly, 这一关系表明,拥有理科学位的男女比例较高的州,其相似度要高于拥有理科学位的男女比例较低的州, 由于更高的分子和分母值意味着分数更接近于1的值. 这将在稍后通过取这两个变量的比率并将状态值映射到choropleth中更加直观.

初始化州经纬度数据和创建地形图将有助于更好地了解这些发现的区域含义.

In [19]:
states <- map_data("state")
head(states)
dim(states)
head(subset)
longlatgrouporderregionsubregion
-87.4620130.38968 1 1 alabama NA
-87.4849330.37249 1 2 alabama NA
-87.5250330.37249 1 3 alabama NA
-87.5307630.33239 1 4 alabama NA
-87.5708730.32665 1 5 alabama NA
-87.5880630.32665 1 6 alabama NA
  1. 15537
  2. 6
StateTotalTotalMaleTotalFemaleSciEngTotalSciEngMaleSciEngFemaleSciEngRelatedTotalSciEngRelatedMaleSciEngRelatedFemale...percentSciEngFemalepercentSciEngRelatedMalepercentSciEngRelatedFemalepercentBusinessMalepercentBusinessFemalepercentEducationMalepercentEducationFemalepercentHumanitiesMalepercentHumanitiesFemaleSciEngRatio
2Alabama 792876 366201 426675 232948 146493 86455 79424 20420 59004 ... 20.26249 5.576173 13.82879 27.62663 19.12017 7.468576 25.42732 19.32518 21.36122 0.5065188
3Alaska 139416 67843 71573 50587 29875 20712 12854 3428 9426 ... 28.93829 5.052843 13.16977 18.57819 13.81946 9.125481 19.83150 23.20799 24.24098 0.6571582
4Arizona 1257449 621477 635972 416610 262739 153871 119456 34583 84873 ... 24.19462 5.564647 13.34540 25.39499 17.93601 7.067840 21.91087 19.69598 22.61310 0.5722941
5Arkansas 433381 198339 235042 126510 78847 47663 42437 9224 33213 ... 20.27850 4.650623 14.13067 27.88357 17.01526 8.265142 28.54256 19.44701 20.03302 0.5101041
6California8415690 4123037 4292653 3427565 2040615 1386950 674278 209270 465008 ... 32.30986 5.075628 10.83265 20.86023 16.93233 3.194441 10.06075 21.37669 29.86442 0.6528166
7Colorado 1440776 705075 735701 553496 331019 222477 121176 36832 84344 ... 30.24014 5.223841 11.46444 23.05031 16.80261 5.171365 15.04361 19.60642 26.44920 0.6441191
In [20]:
names(subset) <- tolower(names(subset))
subset$region <- tolower(subset$state)
head(subset)
statetotaltotalmaletotalfemalesciengtotalsciengmalesciengfemalesciengrelatedtotalsciengrelatedmalesciengrelatedfemale...percentsciengrelatedmalepercentsciengrelatedfemalepercentbusinessmalepercentbusinessfemalepercenteducationmalepercenteducationfemalepercenthumanitiesmalepercenthumanitiesfemalesciengratioregion
2Alabama 792876 366201 426675 232948 146493 86455 79424 20420 59004 ... 5.576173 13.82879 27.62663 19.12017 7.468576 25.42732 19.32518 21.36122 0.5065188 alabama
3Alaska 139416 67843 71573 50587 29875 20712 12854 3428 9426 ... 5.052843 13.16977 18.57819 13.81946 9.125481 19.83150 23.20799 24.24098 0.6571582 alaska
4Arizona 1257449 621477 635972 416610 262739 153871 119456 34583 84873 ... 5.564647 13.34540 25.39499 17.93601 7.067840 21.91087 19.69598 22.61310 0.5722941 arizona
5Arkansas 433381 198339 235042 126510 78847 47663 42437 9224 33213 ... 4.650623 14.13067 27.88357 17.01526 8.265142 28.54256 19.44701 20.03302 0.5101041 arkansas
6California8415690 4123037 4292653 3427565 2040615 1386950 674278 209270 465008 ... 5.075628 10.83265 20.86023 16.93233 3.194441 10.06075 21.37669 29.86442 0.6528166 california
7Colorado 1440776 705075 735701 553496 331019 222477 121176 36832 84344 ... 5.223841 11.46444 23.05031 16.80261 5.171365 15.04361 19.60642 26.44920 0.6441191 colorado

上面的代码通过用小写标题重命名列来修改原始的“子集”数据,并添加名为“region”的最后一列,状态数据作为每个条目与给定的状态共享.

In [21]:
choro_df <- merge(states, subset, by = "region") #合并(df1, df2 =“列向量”)
head(choro_df)
regionlonglatgroupordersubregionstatetotaltotalmaletotalfemale...percentsciengfemalepercentsciengrelatedmalepercentsciengrelatedfemalepercentbusinessmalepercentbusinessfemalepercenteducationmalepercenteducationfemalepercenthumanitiesmalepercenthumanitiesfemalesciengratio
alabama -87.4620130.38968 1 1 NA Alabama 792876 366201 426675 ... 20.26249 5.576173 13.82879 27.62663 19.12017 7.468576 25.42732 19.32518 21.36122 0.5065188
alabama -87.4849330.37249 1 2 NA Alabama 792876 366201 426675 ... 20.26249 5.576173 13.82879 27.62663 19.12017 7.468576 25.42732 19.32518 21.36122 0.5065188
alabama -87.5250330.37249 1 3 NA Alabama 792876 366201 426675 ... 20.26249 5.576173 13.82879 27.62663 19.12017 7.468576 25.42732 19.32518 21.36122 0.5065188
alabama -87.5307630.33239 1 4 NA Alabama 792876 366201 426675 ... 20.26249 5.576173 13.82879 27.62663 19.12017 7.468576 25.42732 19.32518 21.36122 0.5065188
alabama -87.5708730.32665 1 5 NA Alabama 792876 366201 426675 ... 20.26249 5.576173 13.82879 27.62663 19.12017 7.468576 25.42732 19.32518 21.36122 0.5065188
alabama -87.5880630.32665 1 6 NA Alabama 792876 366201 426675 ... 20.26249 5.576173 13.82879 27.62663 19.12017 7.468576 25.42732 19.32518 21.36122 0.5065188

在创建匹配两个数据框架的列之后, 它们可以用“merge”命令合并,然后按order列排序.

以下两幅地图分别题为“持有科学/工程专业的女性学位”和“持有科学/工程专业的男性学位”,显示较早前进行的原始统计回归分析的结果. 拥有科学和工程学位的男性较多的州,拥有科学和工程学位的女性也较多. 每个人旁边的图例将表明两性之间在学位数量上的相对差距. In a regional context, 因为科学和工程学位之间有很强的直接联系, the maps look very similar.

In [22]:
choro <- choro_df[order(choro_df$order),]  #order by "order" column
head(choro)
regionlonglatgroupordersubregionstatetotaltotalmaletotalfemale...percentsciengfemalepercentsciengrelatedmalepercentsciengrelatedfemalepercentbusinessmalepercentbusinessfemalepercenteducationmalepercenteducationfemalepercenthumanitiesmalepercenthumanitiesfemalesciengratio
alabama -87.4620130.38968 1 1 NA Alabama 792876 366201 426675 ... 20.26249 5.576173 13.82879 27.62663 19.12017 7.468576 25.42732 19.32518 21.36122 0.5065188
alabama -87.4849330.37249 1 2 NA Alabama 792876 366201 426675 ... 20.26249 5.576173 13.82879 27.62663 19.12017 7.468576 25.42732 19.32518 21.36122 0.5065188
alabama -87.5250330.37249 1 3 NA Alabama 792876 366201 426675 ... 20.26249 5.576173 13.82879 27.62663 19.12017 7.468576 25.42732 19.32518 21.36122 0.5065188
alabama -87.5307630.33239 1 4 NA Alabama 792876 366201 426675 ... 20.26249 5.576173 13.82879 27.62663 19.12017 7.468576 25.42732 19.32518 21.36122 0.5065188
alabama -87.5708730.32665 1 5 NA Alabama 792876 366201 426675 ... 20.26249 5.576173 13.82879 27.62663 19.12017 7.468576 25.42732 19.32518 21.36122 0.5065188
alabama -87.5880630.32665 1 6 NA Alabama 792876 366201 426675 ... 20.26249 5.576173 13.82879 27.62663 19.12017 7.468576 25.42732 19.32518 21.36122 0.5065188
In [23]:
choro$breaks <- cut(as.numeric(choro$sciengfemale),breaks = seq(0,1400000, by = 100000), include.lowest = TRUE, 
                    labels = c("0-100,000","100,001-200,000","200,001-300,000","300,001-400,000","400,001-500,000",
                              "500,001-600,000","600,001-700,000","700,001-800,000","800,001-900,000","900,001-1,000,000",
                              "1,000,001-1,100,000","1,100,001-1,200,000","1,200,001-1,300,000","1,300,001-1,400,000"))
                              
#choro$breaks <- cut(as.数字(choro$sciengfemale),break = seq(0,1500000, by = 250000),包括.lowest = TRUE, 
# labels = c("0-250,000","250,000-500,000","250,000-500,000", "250,000-500,000",
#                              "500,000-750,000","750,000-1,000,000","1,000,000-1,250,000"))
qplot(long, lat, data = choro, group = group, fill = breaks, geom = "polygon", 
      main = “主修科学或工程的女性学位持有人”) + 
    scale_fill_brewer(name = "Number of Degrees", palette = "Reds")

以上是女性的原始学位计数图, break separation units of 100,000 were used, 而断裂分离单位为200,000 were used for the men. 即使间隔不同,所有状态的趋势都有大约1.男性的科学学位是女性的5倍,这一点可以从颜色的相似性中看出.

In [129]:
choro$breaks <- cut(as.numeric(choro$sciengmale),breaks = seq(0,2200000, by = 200000), include.lowest = TRUE, 
                    labels = c("0-200,000","200,001-400,000","400,001-600,000","600,001-800,000","800,001-1,000,000",
                              "1,000,001-1,200,000","1,200,001-1,400,000","1,400,001-1,600,000","1,600,001-1,800,000","1,800,001-2,000,000",
                              "2,000,001-2,200,000"))
qplot(long, lat, data = choro, group = group, fill = breaks, geom = "polygon", 
      main = 男学士,主修理工科) + 
    scale_fill_brewer(name = "Number of Degrees", palette = "Blues")

虽然原始数据表明,男女在科学领域的学位数量没有地区差异, 看看每个州理科或工科学位的比例,就会得出不同的结论. Comparing the two, 在这个国家的一些地区,从事科学研究的男性多于女性,反之亦然. 有些州两者的比例确实很高, like California and New York, 但在怀俄明州和佛罗里达州等其他州,男性的比例较高,女性的比例相对较低. 除了地区差异, 这些地图确实进一步证明,从事科学研究的男性多于女性. 女性获得学位的比例要低得多, 正如前面所示,拥有科学和工程学位的男性和女性的中位数百分比.

In [25]:
choro$breaks <- cut(choro$percentsciengfemale,breaks = seq(15,45, by = 5), include.lowest = TRUE, 
                    labels = c("15%-20%","20%-25%","25%-30%",
                              "30%-35%","35%-40%","40%-45%"))
qplot(long, lat, data = choro, group = group, fill = breaks, geom = "polygon", 
      main = “持有理工科专业学位的女性比例”) + 
    scale_fill_brewer(name = "Degree Rates", palette = "Reds")
In [26]:
choro$breaks <- cut(choro$percentsciengmale,breaks = seq(30,55, by = 5), include.lowest = TRUE, 
                    labels = c("30%-35%","35%-40%","40%-45%",
                              "45%-50%","50%-55%"))
qplot(long, lat, data = choro, group = group, fill = breaks, geom = "polygon", 
      main = 持有理工科专业学位的男性比例) + 
    scale_fill_brewer(name = "Degree Rates", palette = "Blues")

将拥有理工科学位的女性占其他学位的比例与拥有理工科学位的男性占比进行比较, we can see from just one, rather than two, 绘制地区如何影响男性和女性获得科学和工程学位的相对比率. 颜色较深的州表示女性比例较高,但该比例永远不会达到1. 总的趋势是,东西海岸的比率高于全国其他地区. 不像前两幅地图显示了非常相似的区域模式, 这张比例图清楚地显示了全国不同地区大学专业性别偏见的不同程度.

In [27]:
choro$breaks <- cut(choro$percentsciengfemale/choro$percentsciengmale,breaks = seq(0.35,0.85,by = 0.05), include.lowest = TRUE, 
                    labels = c("0.35-0.40","0.40-0.45","0.45-0.50","0.50-0.55",
                              "0.55-0.60","0.60-0.65","0.65-0.70","0.70-0.75","0.75-0.80","0.80-0.85"))
#choro$breaks = cut(choro$percentsciengfemale/choro$percentsciengmale, 6)
qplot(long, lat, data = choro, group = group, fill = breaks, geom = "polygon", 
      main = “理工科专业男女比例”) + 
    scale_fill_brewer(name = "Ratio of Percents",
                       palette = "Purples")
In [69]:
desired_columns <- c(3,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72)  #重新输入原始的“子集”数据帧,没有被choropleth合并改变
desired_rows <- seq(2,53) #all states
subset <- bachelors[desired_rows, desired_columns] 
colnames(subset) <- c("State","Total","TotalMale","TotalFemale",
                      "SciEngTotal", "SciEngMale", "SciEngFemale",
                      "SciEngRelatedTotal", "SciEngRelatedMale", "SciEngRelatedFemale",
                      "BusinessTotal","BusinessMale","BusinessFemale",
                      "EducationTotal","EducationMale","EducationFemale",
                     "HumanitiesTotal","HumanitiesMale","HumanitiesFemale")

subset$percentSciEngMale <- as.numeric(subset$SciEngMale)/as.numeric(subset$TotalMale)*100           #而这对没有直接的关系
subset$percentSciEngFemale <- as.numeric(subset$SciEngFemale)/as.numeric(subset$TotalFemale)*100

subset$SciEngRatio <- subset$percentSciEngFemale/subset$percentSciEngMale   

head(subset)
StateTotalTotalMaleTotalFemaleSciEngTotalSciEngMaleSciEngFemaleSciEngRelatedTotalSciEngRelatedMaleSciEngRelatedFemale...BusinessFemaleEducationTotalEducationMaleEducationFemaleHumanitiesTotalHumanitiesMaleHumanitiesFemalepercentSciEngMalepercentSciEngFemaleSciEngRatio
2Alabama 792876 366201 426675 232948 146493 86455 79424 20420 59004 ... 81581 135842 27350 108492 161912 70769 91143 40.00344 20.26249 0.5065188
3Alaska 139416 67843 71573 50587 29875 20712 12854 3428 9426 ... 9891 20385 6191 14194 33095 15745 17350 44.03549 28.93829 0.6571582
4Arizona 1257449 621477 635972 416610 262739 153871 119456 34583 84873 ... 114068 183272 43925 139347 266219 122406 143813 42.27654 24.19462 0.5722941
5Arkansas 433381 198339 235042 126510 78847 47663 42437 9224 33213 ... 39993 83480 16393 67087 85657 38571 47086 39.75365 20.27850 0.5101041
6California8415690 4123037 4292653 3427565 2040615 1386950 674278 209270 465008 ... 726846 563581 131708 431873 2163345 881369 1281976 49.49301 32.30986 0.6528166
7Colorado 1440776 705075 735701 553496 331019 222477 121176 36832 84344 ... 123617 147138 36462 110676 332827 138240 194587 46.94806 30.24014 0.6441191
In [74]:
sort(as.numeric(subset$SciEngMale), decreasing = TRUE)[1:10]
subset[which(subset$SciEngMale == "2040615"), "State"]
subset[which(subset$SciEngMale == "1074765"), "State"]
subset[which(subset$SciEngMale == "912146"), "State"]

sort(as.numeric(subset$SciEngFemale), decreasing = TRUE)[1:10]
subset[which(subset$SciEngFemale == "1386950"), "State"]
subset[which(subset$SciEngFemale == "711283"), "State"]
subset[which(subset$SciEngFemale == "645017"), "State"]
  1. 2040615
  2. 1074765
  3. 912146
  4. 797404
  5. 561926
  6. 528311
  7. 505464
  8. 497992
  9. 440726
  10. 419946
'California'
'Texas'
'New York'
  1. 1386950
  2. 711283
  3. 645017
  4. 486289
  5. 370482
  6. 357891
  7. 356819
  8. 346375
  9. 340110
  10. 295373
'California'
'New York'
'Texas'

通过对拥有最高科学学位的州进行分类, 使用“which”函数可以识别这些计数属于哪些状态. 加州拥有最多的男女理工科专业学位持有者. 纽约州和得克萨斯州的这两项指标也排在第二位.

In [75]:
sort(as.numeric(subset$percentSciEngMale), decreasing = TRUE)[1:10]
subset[which(subset$percentSciEngMale == "51.9724320009824"), "State"]
subset[which(subset$percentSciEngMale == "50.63696225976"), "State"]
subset[which(subset$percentSciEngMale == "49.9569814248343"), "State"]

sort(as.numeric(subset$percentSciEngFemale), decreasing = TRUE)[1:10]
subset[which(subset$percentSciEngFemale == "41.6149338278437"), "State"]
subset[which(subset$percentSciEngFemale == "33.1419517414361"), "State"]
subset[which(subset$percentSciEngFemale == "32.9437484466618"), "State"]
  1. 51.9724320009824
  2. 50.63696225976
  3. 49.9569814248343
  4. 49.7663620413637
  5. 49.4930072177378
  6. 47.6220113737173
  7. 47.3557411283058
  8. 47.1133741667282
  9. 46.9480551714357
  10. 46.0928226636069
'District of Columbia'
'Washington'
'Maryland'
  1. 41.6149338278437
  2. 33.1419517414361
  3. 32.9437484466618
  4. 32.3098559329161
  5. 32.3026458001363
  6. 31.5553384998919
  7. 30.3424863564093
  8. 30.2401383170609
  9. 29.9560656325652
  10. 29.6273939162456
'District of Columbia'
'Massachusetts'
'Virginia'

华盛顿特区拥有科学和工程学位的男女比例最高. 华盛顿州和马里兰州的男性人口高度集中, 而马萨诸塞州和弗吉尼亚州的女性比例较高.

In [127]:
subset[which(subset$percentSciEngMale > 46), "State"]
subset[which(as.numeric(subset$SciEngMale) > 410000), "State"]

order(subset$percentSciEngMale)[42:52]
order(as.numeric(subset$SciEngMale))[42:52]

subset[which(subset$percentSciEngFemale > 29), "State"]
subset[which(as.numeric(subset$SciEngFemale) > 290000), "State"]

order(subset$percentSciEngFemale)[42:52]
order(as.numeric(subset$SciEngFemale))[42:52]
order(subset$SciEngRatio)[42:52]
  1. 'California'
  2. 'Colorado'
  3. 'District of Columbia'
  4. 'Maryland'
  5. 'Massachusetts'
  6. 'New Hampshire'
  7. 'Oregon'
  8. 'Virginia'
  9. 'Washington'
  10. 'Wyoming'
  1. 'California'
  2. 'Florida'
  3. 'Illinois'
  4. 'Massachusetts'
  5. 'New Jersey'
  6. 'New York'
  7. 'Pennsylvania'
  8. 'Texas'
  9. 'Virginia'
  10. 'Washington'
  1. 32
  2. 51
  3. 6
  4. 30
  5. 38
  6. 22
  7. 5
  8. 47
  9. 21
  10. 48
  11. 9
  1. 36
  2. 48
  3. 22
  4. 31
  5. 47
  6. 39
  7. 14
  8. 10
  9. 33
  10. 44
  11. 5
  1. 'California'
  2. 'Colorado'
  3. 'District of Columbia'
  4. 'Maryland'
  5. 'Massachusetts'
  6. 'New Jersey'
  7. 'Oregon'
  8. 'Vermont'
  9. 'Virginia'
  10. 'Washington'
  1. 'California'
  2. 'Florida'
  3. 'Illinois'
  4. 'Massachusetts'
  5. 'New Jersey'
  6. 'New York'
  7. 'North Carolina'
  8. 'Pennsylvania'
  9. 'Texas'
  10. 'Virginia'
  1. 2
  2. 31
  3. 46
  4. 6
  5. 38
  6. 48
  7. 21
  8. 5
  9. 47
  10. 22
  11. 9
  1. 21
  2. 34
  3. 22
  4. 39
  5. 31
  6. 47
  7. 14
  8. 10
  9. 44
  10. 33
  11. 5
  1. 6
  2. 21
  3. 5
  4. 46
  5. 2
  6. 47
  7. 31
  8. 8
  9. 33
  10. 22
  11. 9

通过“which”和“order”函数查看州名及其行条目号, 科学和工程专业男女学生最集中的州并不一定与学位人数最多的州相匹配. Likewise, 男女比例较高的州并不一定与女性比例最高的州相匹配. 这些数字表明,尽管在全国范围内,男女理科和工科学位的比例似乎相对稳定,为3:2, 在州一级,从事科学工作的男女比例几乎不一致. 因此,科学领域的性别差异程度取决于国家的地区. 注意,“which”命令按字母顺序列出了各州, 而order命令则以升序列出所需单位的状态, 这就是为什么使用42到52的行计数来索引最大值的原因.

In [30]:
summary(lm(log(as.numeric(subset$SciEngRatio)) ~ log(as.numeric(subset$SciEngTotal))))
Call:
lm(formula = log(as.数值(子集$SciEngRatio)) ~ log(as.numeric(subset$SciEngTotal)))

Residuals:
     最小1Q中位数3Q最大值 
-0.35757 -0.07210 -0.01002  0.07399  0.34999 

Coefficients:
                                    Estimate Std. Error t value    Pr(>|t|)    
(拦截 )                         - 1.02118    0.18148  -5.627 0.000000827 ***
log(as.数字(子集SciEngTotal美元))0.03825    0.01454   2.630      0.0113 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1118 on 50 degrees of freedom
Multiple R-squared:  0.1215,	Adjusted R-squared:  0.104 
F-statistic: 6.918在1和50 DF上,p值为0.01131

上述回归表明,比例图与每个州的理工科学位总数不存在强相关关系. 这是一个值得关注的领域,因为科学和工程专业学位数图和比例图看起来很相似. 如果比例图只是表明,拥有更多科学和工程学位的州,女性在科学领域的比例高于男性在科学领域的比例, 这可能只是表明,更多的学位持有者意味着更公平的比例. 对两个因变量取对数, 理工科专业的性别比例, 自变量是每个状态的总度数, 确定总学位数的百分比变化对比率百分比变化的影响. 否则,单位差异将产生无意义的统计数据. Having a slope of just 0.03825表明,一个州的学位数量与主修科学和工程的男女比例的性别偏见程度之间几乎没有关系. 一个州的学位数量每增加1%,只会增加3个.1% change on the ratio.

分析区域的文献并不多, college major choice, 和性别在同一背景下. However, 考虑到这些偏好是女性不选择工程学等专业的一个重要因素, 结合这次报告的结果的下一步是将区域影响纳入叙述,声称该国某些地区为妇女创造了对科学有更积极印象的环境. 在他的纽约联邦储备银行工作人员报告中, Basit Zafar通过前面提到的计量经济模型发现,“工程领域60%的性别差距是由于偏好的差异。, 而30%是由于女性和男性在多大程度上认为他们会喜欢学习工程”(扎法尔4)。. 对于为什么学生选择一个专业而不是另一个专业,还有其他的解释,包括强调专业选择和政治意识形态之间的联系. 2006年的一篇论文发现,“自由主义学生更有可能选择非科学专业”(波特,2006)。. 这一解释似乎与研究结果背道而驰,即在典型的更自由的美国沿海地区,男性和女性的理科学位占总学位的比例都很高. However, 波特和乌姆巴赫的论文中使用的调查只测试了一所非常挑剔的文理学院,并且承认结果不能外推到来自不同类型学校的更大样本的学生. Similarly, 本文研究的2015年数据是性别关系的单一快照, major, and region. 进一步的时间分析应该被考虑,以确定科学和工程专业选择中性别不平衡的变化景观.

性别差距分析的复杂性超出了数据的限制. 选定的检查范围本质上改变了可能解释的范围. 一项研究不仅考察了性别还考察了社会经济地位(SES), 马教授发现,“社会经济地位背景较低的女性和男性一样,更有可能选择利润丰厚的大学专业。 “有利可图的大学专业选择在潜在地提升学生及其家庭的社会经济地位方面的作用超过了传统的性别角色社会化,后者导致了男性和女性所面向的不同职业道路”(228 Ma)。. 在一篇买球平台公民身份的论文中, 作者发现“更倾向于注册SEM[科学], 工程和数学]领域对外国出生人口的影响,并且与公民相比,注册社会科学的倾向较低”(Nores 138)。. 为了完全分解大学专业选择中性别差异的所有可能影响, 所有可能的变量都必须包括在分析中.

尽管计算出的空间地图和比例表明,全国不同地区在大学专业选择方面存在不同程度的性别偏见, 证明地理原因的能力不在本报告的范围之内. However, if government policies, educational backgrounds, 或者文化差异与地域有关, 然后进行的分析可能是一个起点,以确定为什么全国不同层次的女性在本科职业生涯中系统性地选择不进入科学或工程领域. 此外,大学生不一定来自他们学习的同一个州. 因此,州偏见可能表明特定州学术机构的质量差异,而不是任何性别平等差异. 更好的学校可能有更多的资源用于科学和工程研究. 根据美国人口普查局美国社区调查的数据,可以肯定的是,确实存在一个原因,为什么女性没有以与男性相同的速度进入科学和工程领域,至少在地区和性别差异程度之间存在间接关系.

Bibliography

Daymont, Thomas N., and Paul J. Andrisani. 《买球平台》.《买球平台》第19期,第1期. 3 (1984): 408-28. doi:10.2307/145880.

Ma, Yingyi. "Family Socioeconomic Status, Parental Involvement, 以及大学专业选择——性别, Race/Ethnic, and Nativity Patterns.《买球平台》第52卷,第2期. 2 (2009): 211-34. doi:10.1525/sop.2009.52.2.211.

Nores, Milagros. 公民身份对大学专业选择的影响.《买球平台》(2010):125-41. http://www.jstor.org/stable/40607409.

Porter, Stephen R., and Paul D. Umbach. 《买球平台》.《买球平台》第47期. 4 (2006): 429-49. doi:10.1007/s11162-005-9002-3

United States Census Bureau. (2015). 美国社区调查[单身汉].csv]. Retrieved from http://factfinder.census.gov/faces/nav/jsf/pages/index.xhtml

Zafar, Basit. 《买球平台》.电子学报(2013):1-50. doi:10.2139/ssrn.1348219.