Jason Greenberg

研究问题:从他们的本科生涯开始, 女性是否面临进入STEM相关领域的障碍或强烈反对进入STEM相关领域, 特别是科学和工程, due to their gender? 此外，美国境内的地区是否会影响这种性别差异?

不断变化的性别期望和社会对女性越来越平等的待遇，促使来自不同学科的研究人员开始分析，是什么原因导致了男女收入差距的持续存在. 许多因素影响工人的工资, including job industry, experience, inherent ability, lifestyle preferences, and performance. 这些预测因素中的许多都是主观的，难以衡量. 这次演讲不会集中解释男性收入中位数较高的原因, 而是将研究大学专业选择中的性别差异, 哪个是最终职业轨迹和收入的指标. 从2015年美国社区调查(American Community Survey)的本科专业数据中可以看出，对理工科专业的系统性性别偏好和对理工科专业的偏好是显而易见的. Moreover, 然而，在全国范围内，在州一级，科学和工程专业的男女人数之间存在着奇怪的差距, 科学和工程专业的男性学位持有者与女性学位持有者的相对比例表明，美国不同地区在本科阶段的科学领域面临着不同程度的性别差异.

library(ggplot2)
library(maps)
library(RColorBrewer)

Warning message:
'maps'包是在R版本3下构建的.3.3"

上面的这些包是运行支持本文论证的图形所必需的。.

bachelors <- read.csv("bachelors.csv", header = TRUE, stringsAsFactors = FALSE)
 dim(bachelors)
 head(bachelors)

The "bachelors.csv”文件包含了2015年美国和波多黎各各个地理区域的男女学士学位持有者的信息. 由于这是本课程Problem Set 2使用的数据集，因此不需要进行主要的数据清理. 为了演示的目的, 只有各州的综合数据，而不是城市数据, rural, 或者使用城市特定数据.

desired_columns <- c(3, 4, 16, 28, 40, 52, 64)
desired_rows <- seq(2,53) #所有州，华盛顿特区和波多黎各
subsetTotal <- bachelors[desired_rows, desired_columns] 
colnames(subsetTotal) <- c("State","Total", "SciEng", "SciEngRelated", "Business", "Education", "HumArts")
dim(subsetTotal)
subsetTotal

这个数据框展示了2015年以来所有50个州25岁及以上的人获得学士学位的总数, Washington DC, and Puerto Rico. 包括五类专业. 美国人口普查局美国社区调查将科学和工程相关专业定义为包括护理, architecture, 并获得数学教师教育学位, 而科学和工程类别包括生物学, chemistry, physics, mathematics, computer science, and social science degrees.

colnames(subsetTotal) <- c("State","Total", "SciEng", "SciEngRelated", "Business", "Education", "HumArts")
bandNames <- colnames(subsetTotal[,3:7])
par(mfrow = c(3,2))
par(mar = c(0,0,0,0))
 for(j in 1:5){ 
    hist(as.numeric(subsetTotal[,j+2])/as.numeric(subsetTotal[,2]),breaks = seq(0,0.7, by=0.05),ylim = c(0,50),
        axes = FALSE, main = "", xlab = "", ylab = "", col = "grey")
    box()
    text(x = .33, y=40, label = bandNames[j])
 }

在开始分析男女在大学专业选择中的性别和地域偏见之前, 看到所有观察点包括男性和女性的综合数字的专业分布是有帮助的. 上面的视觉效果包括五个直方图，分别代表了大学专业类型的各个类别. x轴表示该专业学位持有者的百分比，范围从0到70%，由“breaks = seq(0)”定义,0.7, by=0.05)" code, 而y轴则以状态数表示频率, which ranges from 0 to 50. 每个条形代表5%的存储范围. 使用“par”和forloop生成分组的直方图集, 单个直方图标题通过使用“colnames”函数连接到原始数据框“subsetTotal”列名. 对于这个非地区特定的国家主要数据，所有学位持有者超过25岁, 科学和工程专业在总学位中所占比例最高. 这体现在较高的中位数水平上, over 20 states, 大约占各州所有学位的30%，而且相对均衡, bell-curve shaped distribution. Meanwhile, 科学和工程相关领域的中位数较低, 30个州有5%到10%的学位持有者拥有这类学位, and education major degrees, 大约有20个州拥有10%到15%的此类学位, signifies lower popularity.

Desired_columnsMale <- c(8, 20, 32, 44, 56, 68) #men totals
Desired_rowsMale <- seq(2,53) #all states
SubsetMale <- bachelors[Desired_rowsMale, Desired_columnsMale] 
colnames(SubsetMale) <- c("TotalMale", "SciEngMale", "SciEngRelatedMale", "BusinessMale", 
                          "EducationMale", "HumArtsMale")
bandNames <- colnames(SubsetMale[,-1])
par(mfrow = c(3,2))
par(mar = c(0,0,0,0))
 for(j in 1:5){ 
    hist(as.numeric(SubsetMale[,j+1])/as.numeric(SubsetMale[,1]),breaks = seq(0,0.7, by=0.05),ylim = c(0,50),
        axes = FALSE, main = "", xlab = "", ylab = "", col = "grey")
    box()
    text(x = .33, y=40, label = bandNames[j])
 }

Desired_columnsFemale <- c(12, 24, 36, 48, 60, 72) #women totals
Desired_rowsFemale <- seq(2,53) #all states
SubsetFemale <- bachelors[Desired_rowsFemale, Desired_columnsFemale] 
colnames(SubsetFemale) <- c("TotalFemale", "SciEngFemale", "SciEngRelatedFemale", "BusinessFemale", 
                            "EducationFemale", "HumArtsFemale")
bandNames <- colnames(SubsetFemale[,-1])
par(mfrow = c(3,2))
par(mar = c(0,0,0,0))
 for(j in 1:5){ 
    hist(as.numeric(SubsetFemale[,j+1])/as.numeric(SubsetFemale[,1]),breaks = seq(0,0.7, by=0.05),ylim = c(0,50),
        axes = FALSE, main = "", xlab = "", ylab = "", col = "grey")
    box()
    text(x = .33, y=40, label = bandNames[j])
 }

这两组按性别划分的五个直方图有助于显示男性和女性在专业选择频率上的差异. 在上述原始的非性别直方图集中使用了相同的编码技术和结构. This time, 新的子集“subsetMale”和“subsetFemale”被用来代替“subsetTotal”,，其中数据来自原始Excel文件中的特定性别列. 分布中最显著的差异之一存在于理工科专业的直方图中. 男性科学技术人员的分布中心比女性科学技术人员的分布中心高约20%. 没有其他大学专业类型面临这样的性别差异. 进一步的统计分析将有助于澄清具有科学和工程学位的男性人数与具有科学和工程学位的女性人数之间关系的某些方面.

desired_columns <- c(3,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72)
desired_rows <- seq(2,53) #all states
subset <- bachelors[desired_rows, desired_columns] 
colnames(subset) <- c("State","Total","TotalMale","TotalFemale",
                      "SciEngTotal", "SciEngMale", "SciEngFemale",
                      "SciEngRelatedTotal", "SciEngRelatedMale", "SciEngRelatedFemale",
                      "BusinessTotal","BusinessMale","BusinessFemale",
                      "EducationTotal","EducationMale","EducationFemale",
                     "HumanitiesTotal","HumanitiesMale","HumanitiesFemale")

subset$percentMaleTotal <- as.numeric(subset$TotalMale)/as.numeric(subset$Total)*100
subset$percentFemaleTotal <- as.numeric(subset$TotalFemale)/as.numeric(subset$Total)*100

subset$percentOfMaleInSciEng <- as.numeric(subset$SciEngMale)/as.numeric(subset$SciEngTotal)*100     这对百分比之和为100%
subset$percentOfFemaleInSciEng <- as.numeric(subset$SciEngFemale)/as.numeric(subset$SciEngTotal)*100

subset$percentSciEngMale <- as.numeric(subset$SciEngMale)/as.numeric(subset$TotalMale)*100           #而这对没有直接的关系
subset$percentSciEngFemale <- as.numeric(subset$SciEngFemale)/as.numeric(subset$TotalFemale)*100

subset$percentSciEngRelatedMale <- as.numeric(subset$SciEngRelatedMale)/as.numeric(subset$TotalMale)*100          
subset$percentSciEngRelatedFemale <- as.numeric(subset$SciEngRelatedFemale)/as.numeric(subset$TotalFemale)*100

subset$percentBusinessMale <- as.numeric(subset$BusinessMale)/as.numeric(subset$TotalMale)*100
subset$percentBusinessFemale <- as.numeric(subset$BusinessFemale)/as.numeric(subset$TotalFemale)*100

subset$percentEducationMale <- as.numeric(subset$EducationMale)/as.numeric(subset$TotalMale)*100
subset$percentEducationFemale <- as.numeric(subset$EducationFemale)/as.numeric(subset$TotalFemale)*100

subset$percentHumanitiesMale <- as.numeric(subset$HumanitiesMale)/as.numeric(subset$TotalMale)*100
subset$percentHumanitiesFemale <- as.numeric(subset$HumanitiesFemale)/as.numeric(subset$TotalFemale)*100

subset$SciEngRatio <- subset$percentSciEngFemale/subset$percentSciEngMale

head(subset)

将男女各专业的学位持有者人数除以男女各专业的总学位持有者人数, 每个州学位持有者的百分比, major, and gender can be derived. 获得原始计数和相关数据对于更完整的分析很重要. 澄清上面第二对和第三对计算代码中的注释, “男性在科学领域的百分比”和“女性在科学领域的百分比”表示拥有科学和工程学位的男性和女性的百分比与两种性别学位持有者的总和相比. 这就是为什么这两个值之和为100%. Meanwhile, “percentSciEngMale”和“percentSciEngFemale”表示持有理工科学位的男性和女性与其他学位的比例, not the other gender, 这就是为什么这两个百分比加起来很可能不是100%. 这两种计算对于理解人与人之间的关系都很重要, women, and degree choice.

median(as.numeric(subset$TotalMale))
median(as.numeric(subset$TotalFemale))

median(as.numeric(subset$SciEngMale))
median(as.numeric(subset$SciEngFemale))

通过计算每个州男性和女性总学位持有者的中位数, 可以看出，在每个州，女性平均拥有更多的学位. 女性学位持有者与男性学位持有者的比例约为1.2 to 1.而女性理工科学位持有者与男性理工科学位持有者的比例约为1.0 to 1.59, 这表明即使总体上女性的平均水平更高, 平均而言，男性比女性拥有更多的科学和工程学位.

sum(as.numeric(subset$TotalMale))
sum(as.numeric(subset$TotalFemale))

sum(as.numeric(subset$SciEngMale))
sum(as.numeric(subset$SciEngFemale))

计算所有州计数之和的相同比率, 男女总比例约为1.1 to 1.，而理工科并行总比例约为1.0 to 1.5. Therefore, 对于国家中位数比率和总和比率, 一般来说，女性拥有更多的学位, 但男女理工科学位之间的相对差异更大，而且呈相反的关系.

median(subset$percentMaleTotal)
median(subset$percentFemaleTotal)

median(subset$percentOfMaleInSciEng )
median(subset$percentOfFemaleInSciEng)

median(subset$percentSciEngMale)
median(subset$percentSciEngFemale)
median(subset$percentSciEngMale)/median(subset$percentSciEngFemale)

median(subset$percentSciEngRelatedMale)/median(subset$percentSciEngRelatedFemale) 

median(subset$percentBusinessMale)/median(subset$percentBusinessFemale) 

median(subset$percentEducationMale)/median(subset$percentEducationFemale) 

median(subset$percentHumanitiesMale)/median(subset$percentHumanitiesFemale)

从百分比计算来看，女性大约有52%.2015年，25岁以上的人占所有学士学位的5%. Yet they only held about 39.科学和工程学位的4%. 与刚才检查的比率一致, 男性比女性更倾向于进入科学和工程领域. On average, 42.在所有学位中，有3%是基于科学和工程的，而男性只有24%.平均1%的女性学位是科学和工程学位. 这些百分比的比率是1.75, 在所有的大学专业中，哪个是最大的, 商科专业排名第二，为1.46. 学位总数之间的差异, state averages, 各州的平均百分比都表明，女性不进入科学或工程领域是一种社会趋势. 2009年的一篇论文分析了对西北大学161名学生的调查，通过基于调查结果的计量经济模型确定，女性决定不进入特定专业的最重要原因与对课程乐趣的期望有关. 作者认为，男性和女性对特定院系课程期望的差异可能与社会中的性别歧视有关(Zafar 29)。. 1984年的一篇文章使用了1972年全国高中班级纵向研究的数据, 作者认为，“在高中最后一年，他们的偏好出现了实质性的差异, 为各种类型的工作以及随后在大学期间为劳动力市场做准备”(Daymont 414)。. Again, 一个基于调查答案的经济回归模型确定了哪些因素在造成性别差距的最主要因素中占最大比重. 如果学生对职业选择的偏好和印象影响大学专业的选择，从而影响职业道路和收入, 那么，2015年美国社区调查(American Community Survey)的数据进一步支持了这样一种观点，即性别差距甚至在学生完成学业之前就开始形成了.

options(scipen=2000000) #将200万以下数字的科学记数法转换为常规小数
summary(lm(as.numeric(subset$SciEngFemale) ~ as.numeric(subset$SciEngMale)))

Call:
lm(formula = as.数值(子集$SciEngFemale) ~ as.numeric(subset$SciEngMale))

Residuals:
   最小1Q中位数3Q最大值 
-78668  -9063   -421   6422  97610 

Coefficients:
                                  Estimate   Std. Error t value
(拦截 )                   - 3391年.997658  4208.916304  -0.806
as.数字(子集SciEngMale美元)0.676498     0.009741  69.447
                                         Pr(>|t|)    
(拦截 )                                 0.424    
as.numeric(subset$SciEngMale) <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

残差标准误差:23820在50个自由度
Multiple R-squared:  0.9897,	Adjusted R-squared:  0.9895 
F-statistic:  4823 on 1 and 50 DF,  p-value: < 0.00000000000000022

上面的代码输出了一个线性回归摘要，该线性回归摘要是男性持有的科学和工程学位的原始计数，作用于每个州女性持有的科学和工程学位的原始计数, DC, and Puerto Rico. 然后对其他四种专业类型执行相同的汇总输出. 重要的汇总统计数据进一步分析了支持女性如何系统地避免主修科学或工程专业的证据.

summary(lm(as.numeric(subset$SciEngRelatedFemale) ~ as.numeric(subset$SciEngRelatedMale)))

Call:
lm(formula = as.数值(子集$SciEngRelatedFemale) ~ as.数字(子集SciEngRelatedMale美元))

Residuals:
   最小1Q中位数3Q最大值 
-30447  -5203  -2189   6499  34268 

Coefficients:
                                       Estimate Std. Error t value
(拦截 )                          7139年.04527 2119.41889   3.368
as.数字(子集SciEngRelatedMale美元)2.32544    0.04122  56.413
                                                 Pr(>|t|)    
(拦截 )                                       0.00146 ** 
as.numeric(subset$SciEngRelatedMale) < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

残差标准误差:在50自由度上为11470
Multiple R-squared:  0.9845,	Adjusted R-squared:  0.9842 
F-statistic:  3182 on 1 and 50 DF,  p-value: < 0.00000000000000022

summary(lm(as.numeric(subset$BusinessFemale) ~ as.numeric(subset$BusinessMale)))

Call:
lm(formula = as.数字(子集$BusinessFemale) ~ as.numeric(subset$BusinessMale))

Residuals:
   最小1Q中位数3Q最大值 
-42580  -3227    412   2940  56153 

Coefficients:
                                   Estimate  Std. Error t value
(拦截 )                     - 1341年.27859  2532.33558   -0.53
as.数字(子集BusinessMale美元)0.80450     0.01122   71.68
                                           Pr(>|t|)    
(拦截 )                                   0.599    
as.numeric(subset$BusinessMale) <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

残差标准误差:13810在50个自由度
Multiple R-squared:  0.9904,	Adjusted R-squared:  0.9902 
F-statistic:  5138 on 1 and 50 DF,  p-value: < 0.00000000000000022

summary(lm(as.numeric(subset$EducationFemale) ~ as.numeric(subset$EducationMale)))

Call:
lm(formula = as.数字(子集$EducationFemale) ~ as.numeric(subset$EducationMale))

Residuals:
   最小1Q中位数3Q最大值 
-62856  -8567  -1395   7101  82946 

Coefficients:
                                    Estimate  Std. Error t value
(拦截 )                      - 4432年.65497  4627.01232  -0.958
as.数字(子集EducationMale美元)3.35557     0.08909  37.666
                                            Pr(>|t|)    
(拦截 )                                    0.343    
as.numeric(subset$EducationMale) <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

残差标准误差:22270在50个自由度
Multiple R-squared:  0.966,	Adjusted R-squared:  0.9653 
F-statistic:  1419 on 1 and 50 DF,  p-value: < 0.00000000000000022

summary(lm(as.numeric(subset$HumanitiesFemale) ~ as.numeric(subset$HumanitiesMale)))

Call:
lm(formula = as.数值(子集$HumanitiesFemale) ~ as.数字(子集HumanitiesMale美元))

Residuals:
   最小1Q中位数3Q最大值 
-60060  -6516   2465   6738  59227 

Coefficients:
                                     Estimate  Std. Error t value
(拦截 )                       - 8813年.81804  2759.08699  -3.194
as.数字(子集HumanitiesMale美元)1.39733     0.01402  99.692
                                              Pr(>|t|)    
(拦截 )                                    0.00243 ** 
as.numeric(subset$HumanitiesMale) < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

残差标准误差:在50自由度上为15380
Multiple R-squared:  0.995,	Adjusted R-squared:  0.9949 
F-statistic:  9938 on 1 and 50 DF,  p-value: < 0.00000000000000022

每个专业的女性学位持有者数量对男性学位持有者数量的依赖关系在五个专业中都是正的. 看看学士学位持有者的原始数据，这个结果并不令人惊讶. 男女学位数量的增加意味着，在一个州，一种性别的学位持有者较多，另一种性别的学位持有者也相对较多. 而斜率最小的是理工科专业关系，为0.67，其他斜率为0.80, 1.40, 2.33, and 3.36. 基于每个回归的p值可忽略地接近于零, 回归系数均具有统计学显著性.

ggplot(subset, aes(as.numeric(SciEngMale), as.numeric(SciEngFemale)))  +
      geom_point()+
      #scale_x_continuous(name=“男性理科学位持有者总数”，限制=c(0,150000)) +
      #scale_y_continuous(name=“女性理科学位持有者总数”，限制=c(0,150000))+
      labs(x= “全国男性理科学历总人数”) +
      labs(y = 《买球平台》)+ ylim(0,2000000)+
      labs(title= “男女理科学位持有者的关系”) + 
      stat_smooth(method = lm, se = FALSE, color = "black") +
      geom_vline(xintercept = 134349, linetype="dotted", colour="red")+
      geom_hline(yintercept =  84527, linetype="dotted", colour="red")+
      geom_vline(xintercept = 0)+
      geom_hline(yintercept = 0)+
      annotate("text", label = "r^2 == 0.9895", parse = TRUE,x= 1400000, y = 1500000) +
      annotate("text", label = "slope = 0.676634", x= 1475000, y = 1250000)



#qplot(as.numeric(SciEngMale), as.numeric(SciEngFemale)， data =子集，color = I(“深蓝色”)，
# xlab =“男理科总人数”，ylab =“女理科总人数”， 
# main = "理科男女总人数关系")+ geom_smooth(method = "lm"), se = FALSE)
#qplot上面ggplot的版本

上面的ggplot可视化地表示了之前运行的五个回归系列中的第一个线性回归. x轴代表50个州中25岁及以上的男性科学或工程学士学位持有者的总数, DC, and Puerto Rico. y轴代表女性的相同数据. 拥有科学或工程专业学位的男性中位数为134人,349, 而女性的平均年龄是84岁,527. 虚线准星截距表示这一点. Also, r平方值表明，几乎99%的女性科学和工程学位持有者的变化是由男性科学或工程学位持有者数量的变化造成的. 对于只有一个自变量的回归，这个值高得令人难以置信, 但是看看其他的r平方值在其他大学主要类别的回归, 可以看出，存在类似的高值. 在数据失真的可能性之外, 这表明，2015年一个州某一特定专业的男性学位持有者数量是一个非常精确的指标，可以预测该专业的女性学位持有者数量.

summary(lm(subset$percentSciEngFemale ~ subset$percentSciEngMale))
ggplot(subset, aes(percentSciEngMale, percentSciEngFemale))  +
      geom_point()+
      labs(x= “按国家分男性理科学历”) + xlim(32.5,52.5)+ 
      labs(y = “各州持有理科学位的女性比例”)+ ylim(15.25,35.25)+
      labs(title= 《买球平台》) + 
      stat_smooth(method = lm, se = FALSE, color = "black")

Call:
lm(公式=子集$percentSciEngFemale ~子集$percentSciEngMale)

Residuals:
    最小1Q中位数3Q最大值 
-8.9425 -1.5519  0.5676  1.1299  7.3309 

Coefficients:
                          Estimate Std. Error t value            Pr(>|t|)    
(Intercept)              -17.19788    3.77537  -4.555 0.00003383556251666 ***
subset$percentSciEngMale   0.99056    0.08823  11.227 0.00000000000000284 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.589 on 50 degrees of freedom
Multiple R-squared:  0.716,	Adjusted R-squared:  0.7103 
F-statistic: 126.1 on 1 and 50 DF,  p-value: 0.000000000000002836

Warning message:
删除1行包含非有限值(stat_smooth)."Warning message:
"删除1行包含缺失值(geom_point)."

看看每个州主修科学和工程的学位持有者的百分比, as opposed to the raw counts, 可以看到一个相对一对一的斜率，正如在上面的回归中计算的，其值约为0.99. 这种关系乍一看可能与对原始计数的解释相冲突, 但由于观测值的差异, 这个斜率实际上进一步支持了这样一种观点，即从事科学和工程的女性人数明显低于男性. 理工科学士学位持有者的比例每增加一个百分点, 拥有理工科专业学位的女性比例预计也将增加1%. However, 女性的百分比值范围, as seen by the y-axis, 是否都比两种性别中位数州百分比值相差18%左右, 男性的中位数是42.3% for men and 24.1% for women. Importantly, 这一关系表明，拥有理科学位的男女比例较高的州，其相似度要高于拥有理科学位的男女比例较低的州, 由于更高的分子和分母值意味着分数更接近于1的值. 这将在稍后通过取这两个变量的比率并将状态值映射到choropleth中更加直观.

初始化州经纬度数据和创建地形图将有助于更好地了解这些发现的区域含义.

states <- map_data("state")
head(states)
dim(states)
head(subset)

names(subset) <- tolower(names(subset))
subset$region <- tolower(subset$state)
head(subset)

上面的代码通过用小写标题重命名列来修改原始的“子集”数据，并添加名为“region”的最后一列，状态数据作为每个条目与给定的状态共享.

choro_df <- merge(states, subset, by = "region") #合并(df1, df2 =“列向量”)
head(choro_df)

在创建匹配两个数据框架的列之后, 它们可以用“merge”命令合并，然后按order列排序.

以下两幅地图分别题为“持有科学/工程专业的女性学位”和“持有科学/工程专业的男性学位”，显示较早前进行的原始统计回归分析的结果. 拥有科学和工程学位的男性较多的州，拥有科学和工程学位的女性也较多. 每个人旁边的图例将表明两性之间在学位数量上的相对差距. In a regional context, 因为科学和工程学位之间有很强的直接联系, the maps look very similar.

choro <- choro_df[order(choro_df$order),]  #order by "order" column
head(choro)

choro$breaks <- cut(as.numeric(choro$sciengfemale),breaks = seq(0,1400000, by = 100000), include.lowest = TRUE, 
                    labels = c("0-100,000","100,001-200,000","200,001-300,000","300,001-400,000","400,001-500,000",
                              "500,001-600,000","600,001-700,000","700,001-800,000","800,001-900,000","900,001-1,000,000",
                              "1,000,001-1,100,000","1,100,001-1,200,000","1,200,001-1,300,000","1,300,001-1,400,000"))
                              
#choro$breaks <- cut(as.数字(choro$sciengfemale)，break = seq(0,1500000, by = 250000)，包括.lowest = TRUE, 
# labels = c("0-250,000"，"250,000-500,000"，"250,000-500,000"， "250,000-500,000"，
#                              "500,000-750,000","750,000-1,000,000","1,000,000-1,250,000"))
qplot(long, lat, data = choro, group = group, fill = breaks, geom = "polygon", 
      main = “主修科学或工程的女性学位持有人”) + 
    scale_fill_brewer(name = "Number of Degrees", palette = "Reds")

以上是女性的原始学位计数图, break separation units of 100,000 were used, 而断裂分离单位为200,000 were used for the men. 即使间隔不同，所有状态的趋势都有大约1.男性的科学学位是女性的5倍，这一点可以从颜色的相似性中看出.

choro$breaks <- cut(as.numeric(choro$sciengmale),breaks = seq(0,2200000, by = 200000), include.lowest = TRUE, 
                    labels = c("0-200,000","200,001-400,000","400,001-600,000","600,001-800,000","800,001-1,000,000",
                              "1,000,001-1,200,000","1,200,001-1,400,000","1,400,001-1,600,000","1,600,001-1,800,000","1,800,001-2,000,000",
                              "2,000,001-2,200,000"))
qplot(long, lat, data = choro, group = group, fill = breaks, geom = "polygon", 
      main = 男学士，主修理工科) + 
    scale_fill_brewer(name = "Number of Degrees", palette = "Blues")

虽然原始数据表明，男女在科学领域的学位数量没有地区差异, 看看每个州理科或工科学位的比例，就会得出不同的结论. Comparing the two, 在这个国家的一些地区，从事科学研究的男性多于女性，反之亦然. 有些州两者的比例确实很高, like California and New York, 但在怀俄明州和佛罗里达州等其他州，男性的比例较高，女性的比例相对较低. 除了地区差异, 这些地图确实进一步证明，从事科学研究的男性多于女性. 女性获得学位的比例要低得多, 正如前面所示，拥有科学和工程学位的男性和女性的中位数百分比.

choro$breaks <- cut(choro$percentsciengfemale,breaks = seq(15,45, by = 5), include.lowest = TRUE, 
                    labels = c("15%-20%","20%-25%","25%-30%",
                              "30%-35%","35%-40%","40%-45%"))
qplot(long, lat, data = choro, group = group, fill = breaks, geom = "polygon", 
      main = “持有理工科专业学位的女性比例”) + 
    scale_fill_brewer(name = "Degree Rates", palette = "Reds")

choro$breaks <- cut(choro$percentsciengmale,breaks = seq(30,55, by = 5), include.lowest = TRUE, 
                    labels = c("30%-35%","35%-40%","40%-45%",
                              "45%-50%","50%-55%"))
qplot(long, lat, data = choro, group = group, fill = breaks, geom = "polygon", 
      main = 持有理工科专业学位的男性比例) + 
    scale_fill_brewer(name = "Degree Rates", palette = "Blues")

将拥有理工科学位的女性占其他学位的比例与拥有理工科学位的男性占比进行比较, we can see from just one, rather than two, 绘制地区如何影响男性和女性获得科学和工程学位的相对比率. 颜色较深的州表示女性比例较高，但该比例永远不会达到1. 总的趋势是，东西海岸的比率高于全国其他地区. 不像前两幅地图显示了非常相似的区域模式, 这张比例图清楚地显示了全国不同地区大学专业性别偏见的不同程度.

choro$breaks <- cut(choro$percentsciengfemale/choro$percentsciengmale,breaks = seq(0.35,0.85,by = 0.05), include.lowest = TRUE, 
                    labels = c("0.35-0.40","0.40-0.45","0.45-0.50","0.50-0.55",
                              "0.55-0.60","0.60-0.65","0.65-0.70","0.70-0.75","0.75-0.80","0.80-0.85"))
#choro$breaks = cut(choro$percentsciengfemale/choro$percentsciengmale, 6)
qplot(long, lat, data = choro, group = group, fill = breaks, geom = "polygon", 
      main = “理工科专业男女比例”) + 
    scale_fill_brewer(name = "Ratio of Percents",
                       palette = "Purples")

desired_columns <- c(3,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72)  #重新输入原始的“子集”数据帧，没有被choropleth合并改变
desired_rows <- seq(2,53) #all states
subset <- bachelors[desired_rows, desired_columns] 
colnames(subset) <- c("State","Total","TotalMale","TotalFemale",
                      "SciEngTotal", "SciEngMale", "SciEngFemale",
                      "SciEngRelatedTotal", "SciEngRelatedMale", "SciEngRelatedFemale",
                      "BusinessTotal","BusinessMale","BusinessFemale",
                      "EducationTotal","EducationMale","EducationFemale",
                     "HumanitiesTotal","HumanitiesMale","HumanitiesFemale")

subset$percentSciEngMale <- as.numeric(subset$SciEngMale)/as.numeric(subset$TotalMale)*100           #而这对没有直接的关系
subset$percentSciEngFemale <- as.numeric(subset$SciEngFemale)/as.numeric(subset$TotalFemale)*100

subset$SciEngRatio <- subset$percentSciEngFemale/subset$percentSciEngMale   

head(subset)

sort(as.numeric(subset$SciEngMale), decreasing = TRUE)[1:10]
subset[which(subset$SciEngMale == "2040615"), "State"]
subset[which(subset$SciEngMale == "1074765"), "State"]
subset[which(subset$SciEngMale == "912146"), "State"]

sort(as.numeric(subset$SciEngFemale), decreasing = TRUE)[1:10]
subset[which(subset$SciEngFemale == "1386950"), "State"]
subset[which(subset$SciEngFemale == "711283"), "State"]
subset[which(subset$SciEngFemale == "645017"), "State"]

通过对拥有最高科学学位的州进行分类, 使用“which”函数可以识别这些计数属于哪些状态. 加州拥有最多的男女理工科专业学位持有者. 纽约州和得克萨斯州的这两项指标也排在第二位.

sort(as.numeric(subset$percentSciEngMale), decreasing = TRUE)[1:10]
subset[which(subset$percentSciEngMale == "51.9724320009824"), "State"]
subset[which(subset$percentSciEngMale == "50.63696225976"), "State"]
subset[which(subset$percentSciEngMale == "49.9569814248343"), "State"]

sort(as.numeric(subset$percentSciEngFemale), decreasing = TRUE)[1:10]
subset[which(subset$percentSciEngFemale == "41.6149338278437"), "State"]
subset[which(subset$percentSciEngFemale == "33.1419517414361"), "State"]
subset[which(subset$percentSciEngFemale == "32.9437484466618"), "State"]

华盛顿特区拥有科学和工程学位的男女比例最高. 华盛顿州和马里兰州的男性人口高度集中, 而马萨诸塞州和弗吉尼亚州的女性比例较高.

subset[which(subset$percentSciEngMale > 46), "State"]
subset[which(as.numeric(subset$SciEngMale) > 410000), "State"]

order(subset$percentSciEngMale)[42:52]
order(as.numeric(subset$SciEngMale))[42:52]

subset[which(subset$percentSciEngFemale > 29), "State"]
subset[which(as.numeric(subset$SciEngFemale) > 290000), "State"]

order(subset$percentSciEngFemale)[42:52]
order(as.numeric(subset$SciEngFemale))[42:52]
order(subset$SciEngRatio)[42:52]

通过“which”和“order”函数查看州名及其行条目号, 科学和工程专业男女学生最集中的州并不一定与学位人数最多的州相匹配. Likewise, 男女比例较高的州并不一定与女性比例最高的州相匹配. 这些数字表明，尽管在全国范围内，男女理科和工科学位的比例似乎相对稳定，为3:2, 在州一级，从事科学工作的男女比例几乎不一致. 因此，科学领域的性别差异程度取决于国家的地区. 注意，“which”命令按字母顺序列出了各州, 而order命令则以升序列出所需单位的状态, 这就是为什么使用42到52的行计数来索引最大值的原因.

summary(lm(log(as.numeric(subset$SciEngRatio)) ~ log(as.numeric(subset$SciEngTotal))))

Call:
lm(formula = log(as.数值(子集$SciEngRatio)) ~ log(as.numeric(subset$SciEngTotal)))

Residuals:
     最小1Q中位数3Q最大值 
-0.35757 -0.07210 -0.01002  0.07399  0.34999 

Coefficients:
                                    Estimate Std. Error t value    Pr(>|t|)    
(拦截 )                         - 1.02118    0.18148  -5.627 0.000000827 ***
log(as.数字(子集SciEngTotal美元))0.03825    0.01454   2.630      0.0113 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1118 on 50 degrees of freedom
Multiple R-squared:  0.1215,	Adjusted R-squared:  0.104 
F-statistic: 6.918在1和50 DF上，p值为0.01131

上述回归表明，比例图与每个州的理工科学位总数不存在强相关关系. 这是一个值得关注的领域，因为科学和工程专业学位数图和比例图看起来很相似. 如果比例图只是表明，拥有更多科学和工程学位的州，女性在科学领域的比例高于男性在科学领域的比例, 这可能只是表明，更多的学位持有者意味着更公平的比例. 对两个因变量取对数, 理工科专业的性别比例, 自变量是每个状态的总度数, 确定总学位数的百分比变化对比率百分比变化的影响. 否则，单位差异将产生无意义的统计数据. Having a slope of just 0.03825表明，一个州的学位数量与主修科学和工程的男女比例的性别偏见程度之间几乎没有关系. 一个州的学位数量每增加1%，只会增加3个.1% change on the ratio.

分析区域的文献并不多, college major choice, 和性别在同一背景下. However, 考虑到这些偏好是女性不选择工程学等专业的一个重要因素, 结合这次报告的结果的下一步是将区域影响纳入叙述，声称该国某些地区为妇女创造了对科学有更积极印象的环境. 在他的纽约联邦储备银行工作人员报告中, Basit Zafar通过前面提到的计量经济模型发现，“工程领域60%的性别差距是由于偏好的差异。, 而30%是由于女性和男性在多大程度上认为他们会喜欢学习工程”(扎法尔4)。. 对于为什么学生选择一个专业而不是另一个专业，还有其他的解释，包括强调专业选择和政治意识形态之间的联系. 2006年的一篇论文发现，“自由主义学生更有可能选择非科学专业”(波特，2006)。. 这一解释似乎与研究结果背道而驰，即在典型的更自由的美国沿海地区，男性和女性的理科学位占总学位的比例都很高. However, 波特和乌姆巴赫的论文中使用的调查只测试了一所非常挑剔的文理学院，并且承认结果不能外推到来自不同类型学校的更大样本的学生. Similarly, 本文研究的2015年数据是性别关系的单一快照, major, and region. 进一步的时间分析应该被考虑，以确定科学和工程专业选择中性别不平衡的变化景观.

性别差距分析的复杂性超出了数据的限制. 选定的检查范围本质上改变了可能解释的范围. 一项研究不仅考察了性别还考察了社会经济地位(SES), 马教授发现，“社会经济地位背景较低的女性和男性一样，更有可能选择利润丰厚的大学专业。 “有利可图的大学专业选择在潜在地提升学生及其家庭的社会经济地位方面的作用超过了传统的性别角色社会化，后者导致了男性和女性所面向的不同职业道路”(228 Ma)。. 在一篇买球平台公民身份的论文中, 作者发现“更倾向于注册SEM[科学], 工程和数学]领域对外国出生人口的影响，并且与公民相比，注册社会科学的倾向较低”(Nores 138)。. 为了完全分解大学专业选择中性别差异的所有可能影响, 所有可能的变量都必须包括在分析中.

尽管计算出的空间地图和比例表明，全国不同地区在大学专业选择方面存在不同程度的性别偏见, 证明地理原因的能力不在本报告的范围之内. However, if government policies, educational backgrounds, 或者文化差异与地域有关, 然后进行的分析可能是一个起点，以确定为什么全国不同层次的女性在本科职业生涯中系统性地选择不进入科学或工程领域. 此外，大学生不一定来自他们学习的同一个州. 因此，州偏见可能表明特定州学术机构的质量差异，而不是任何性别平等差异. 更好的学校可能有更多的资源用于科学和工程研究. 根据美国人口普查局美国社区调查的数据，可以肯定的是，确实存在一个原因，为什么女性没有以与男性相同的速度进入科学和工程领域，至少在地区和性别差异程度之间存在间接关系.

Bibliography

Daymont, Thomas N., and Paul J. Andrisani. 《买球平台》.《买球平台》第19期，第1期. 3 (1984): 408-28. doi:10.2307/145880.

Ma, Yingyi. "Family Socioeconomic Status, Parental Involvement, 以及大学专业选择——性别, Race/Ethnic, and Nativity Patterns.《买球平台》第52卷，第2期. 2 (2009): 211-34. doi:10.1525/sop.2009.52.2.211.

Nores, Milagros. 公民身份对大学专业选择的影响.《买球平台》(2010):125-41. http://www.jstor.org/stable/40607409.

Porter, Stephen R., and Paul D. Umbach. 《买球平台》.《买球平台》第47期. 4 (2006): 429-49. doi:10.1007/s11162-005-9002-3

United States Census Bureau. (2015). 美国社区调查[单身汉].csv]. Retrieved from http://factfinder.census.gov/faces/nav/jsf/pages/index.xhtml

Zafar, Basit. 《买球平台》.电子学报(2013):1-50. doi:10.2139/ssrn.1348219.

GEO.id	GEO.id2	GEO.display.label	HC01_EST_VC01	HC01_MOE_VC01	HC02_EST_VC01	HC02_MOE_VC01	HC03_EST_VC01	HC03_MOE_VC01	HC04_EST_VC01	...	HC02_EST_VC27	HC02_MOE_VC27	HC03_EST_VC27	HC03_MOE_VC27	HC04_EST_VC27	HC04_MOE_VC27	HC05_EST_VC27	HC05_MOE_VC27	HC06_EST_VC27	HC06_MOE_VC27
Id	Id2	Geography	Total; Estimate; Total population 25 years and over with a Bachelor's degree or higher	Total; Margin of Error; Total population 25 years and over with a Bachelor's degree or higher	Percent; Estimate; Total population 25 years and over with a Bachelor's degree or higher	Percent; Margin of Error; Total population 25 years and over with a Bachelor's degree or higher	Males; Estimate; Total population 25 years and over with a Bachelor's degree or higher	Males; Margin of Error; Total population 25 years and over with a Bachelor's degree or higher	Percent Males; Estimate; Total population 25 years and over with a Bachelor's degree or higher	...	Percent; Estimate; DETAILED AGE - 65 years and over - Arts, Humanities and Others	Percent; Margin of Error; DETAILED AGE - 65 years and over - Arts, Humanities and Others	Males; Estimate; DETAILED AGE - 65 years and over - Arts, Humanities and Others	Males; Margin of Error; DETAILED AGE - 65 years and over - Arts, Humanities and Others	Percent Males; Estimate; DETAILED AGE - 65 years and over - Arts, Humanities and Others	Percent Males; Margin of Error; DETAILED AGE - 65 years and over - Arts, Humanities and Others	Females; Estimate; DETAILED AGE - 65 years and over - Arts, Humanities and Others	Females; Margin of Error; DETAILED AGE - 65 years and over - Arts, Humanities and Others	Percent Females; Estimate; DETAILED AGE - 65 years and over - Arts, Humanities and Others	Percent Females; Margin of Error; DETAILED AGE - 65 years and over - Arts, Humanities and Others
0400000US01	1	Alabama	792876	14677	(X)	(X)	366201	9531	(X)	...	17.8	1.6	14284	1696	16.4	1.8	14435	1926	19.4	2.4
0400000US02	2	Alaska	139416	4807	(X)	(X)	67843	3222	(X)	...	18.9	3.6	1951	565	17.1	4.4	2090	580	20.9	5.2
0400000US04	4	Arizona	1257449	16239	(X)	(X)	621477	10626	(X)	...	18.4	1	26996	2115	15.6	1.1	30518	2468	21.9	1.6
0400000US05	5	Arkansas	433381	7690	(X)	(X)	198339	5631	(X)	...	15.8	1.7	6777	1168	14	2.4	7330	1299	17.9	2.6
0400000US06	6	California	8415690	37555	(X)	(X)	4123037	23131	(X)	...	23.6	0.5	157982	5589	18.9	0.6	211503	6420	29.1	0.9

	State	Total	SciEng	SciEngRelated	Business	Education	HumArts
2	Alabama	792876	232948	79424	182750	135842	161912
3	Alaska	139416	50587	12854	22495	20385	33095
4	Arizona	1257449	416610	119456	271892	183272	266219
5	Arkansas	433381	126510	42437	95297	83480	85657
6	California	8415690	3427565	674278	1586921	563581	2163345
7	Colorado	1440776	553496	121176	286139	147138	332827
8	Connecticut	948044	342674	80873	187022	100734	236741
9	Delaware	201929	70332	21046	44771	28364	37416
10	District of Columbia	268345	125167	11869	33555	11676	86078
11	Florida	4092338	1283693	414601	995869	589792	808383
12	Georgia	2000113	641664	178239	485769	273021	421420
13	Hawaii	309194	110773	27619	61107	41200	68495
14	Idaho	276912	92786	30241	50346	45560	57979
15	Illinois	2853540	932408	269244	619650	384915	647323
16	Indiana	1088120	305153	134069	226250	185666	236982
17	Iowa	556591	169312	54940	116686	103090	112563
18	Kansas	599063	175589	64756	129649	105871	123198
19	Kentucky	696174	203689	78840	137156	119107	157382
20	Louisiana	718058	200267	88487	144280	121677	163347
21	Maine	289553	102732	28277	39879	45189	73476
22	Maryland	1591614	645663	138044	295580	157419	354908
23	Massachusetts	1951689	780836	158139	364241	175035	473438
24	Michigan	1870473	609561	194451	397446	277764	391251
25	Minnesota	1284007	429808	123550	252884	185440	292325
26	Mississippi	406599	100307	55009	84900	87213	79170
27	Missouri	1140860	342021	113391	246469	184485	254494
28	Montana	216174	71556	23757	35787	39252	45822
29	Nebraska	372288	103636	40424	84020	71497	72711
30	Nevada	463681	147490	42672	105846	61003	106670
31	New Hampshire	334313	123948	31664	63342	42482	72877
32	New Jersey	2318073	854811	195024	527062	259221	481955
33	New Mexico	364462	127327	32978	57214	58421	88522
34	New York	4778463	1623429	415315	895311	548835	1295573
35	North Carolina	1991057	687074	182823	402340	269287	449533
36	North Dakota	143403	40867	20868	27786	27259	26623
37	Ohio	2115116	650627	233599	448383	342930	439577
38	Oklahoma	630004	182916	61146	143828	122057	120057
39	Oregon	901667	347808	79655	131021	100430	242753
40	Pennsylvania	2641023	874686	276665	530252	396210	563210
41	Rhode Island	238818	80596	22368	45589	29037	61228
42	South Carolina	890241	283678	85236	197700	131685	191942
43	South Dakota	154885	47184	15976	29897	33840	27988
44	Tennessee	1151080	342842	117853	256625	171712	262048
45	Texas	4955374	1719782	445895	1164460	632224	993013
46	Utah	554712	180989	55741	104661	77687	135634
47	Vermont	162072	60132	12850	18942	22690	47458
48	Virginia	2102044	863355	159477	397325	201524	480363
49	Washington	1670893	685505	146243	274943	166411	397791
50	West Virginia	254414	74797	32048	44434	51203	51932
51	Wisconsin	1112458	340019	125112	225032	182504	239791
52	Wyoming	102034	35634	10228	14729	21140	20303
53	Puerto Rico	590228	146944	63591	190620	112350	76723

	State	Total	TotalMale	TotalFemale	SciEngTotal	SciEngMale	SciEngFemale	SciEngRelatedTotal	SciEngRelatedMale	SciEngRelatedFemale	...	percentSciEngFemale	percentSciEngRelatedMale	percentSciEngRelatedFemale	percentBusinessMale	percentBusinessFemale	percentEducationMale	percentEducationFemale	percentHumanitiesMale	percentHumanitiesFemale	SciEngRatio
2	Alabama	792876	366201	426675	232948	146493	86455	79424	20420	59004	...	20.26249	5.576173	13.82879	27.62663	19.12017	7.468576	25.42732	19.32518	21.36122	0.5065188
3	Alaska	139416	67843	71573	50587	29875	20712	12854	3428	9426	...	28.93829	5.052843	13.16977	18.57819	13.81946	9.125481	19.83150	23.20799	24.24098	0.6571582
4	Arizona	1257449	621477	635972	416610	262739	153871	119456	34583	84873	...	24.19462	5.564647	13.34540	25.39499	17.93601	7.067840	21.91087	19.69598	22.61310	0.5722941
5	Arkansas	433381	198339	235042	126510	78847	47663	42437	9224	33213	...	20.27850	4.650623	14.13067	27.88357	17.01526	8.265142	28.54256	19.44701	20.03302	0.5101041
6	California	8415690	4123037	4292653	3427565	2040615	1386950	674278	209270	465008	...	32.30986	5.075628	10.83265	20.86023	16.93233	3.194441	10.06075	21.37669	29.86442	0.6528166
7	Colorado	1440776	705075	735701	553496	331019	222477	121176	36832	84344	...	30.24014	5.223841	11.46444	23.05031	16.80261	5.171365	15.04361	19.60642	26.44920	0.6441191

long	lat	group	order	region	subregion
-87.46201	30.38968	1	1	alabama	NA
-87.48493	30.37249	1	2	alabama	NA
-87.52503	30.37249	1	3	alabama	NA
-87.53076	30.33239	1	4	alabama	NA
-87.57087	30.32665	1	5	alabama	NA
-87.58806	30.32665	1	6	alabama	NA

	State	Total	TotalMale	TotalFemale	SciEngTotal	SciEngMale	SciEngFemale	SciEngRelatedTotal	SciEngRelatedMale	SciEngRelatedFemale	...	percentSciEngFemale	percentSciEngRelatedMale	percentSciEngRelatedFemale	percentBusinessMale	percentBusinessFemale	percentEducationMale	percentEducationFemale	percentHumanitiesMale	percentHumanitiesFemale	SciEngRatio
2	Alabama	792876	366201	426675	232948	146493	86455	79424	20420	59004	...	20.26249	5.576173	13.82879	27.62663	19.12017	7.468576	25.42732	19.32518	21.36122	0.5065188
3	Alaska	139416	67843	71573	50587	29875	20712	12854	3428	9426	...	28.93829	5.052843	13.16977	18.57819	13.81946	9.125481	19.83150	23.20799	24.24098	0.6571582
4	Arizona	1257449	621477	635972	416610	262739	153871	119456	34583	84873	...	24.19462	5.564647	13.34540	25.39499	17.93601	7.067840	21.91087	19.69598	22.61310	0.5722941
5	Arkansas	433381	198339	235042	126510	78847	47663	42437	9224	33213	...	20.27850	4.650623	14.13067	27.88357	17.01526	8.265142	28.54256	19.44701	20.03302	0.5101041
6	California	8415690	4123037	4292653	3427565	2040615	1386950	674278	209270	465008	...	32.30986	5.075628	10.83265	20.86023	16.93233	3.194441	10.06075	21.37669	29.86442	0.6528166
7	Colorado	1440776	705075	735701	553496	331019	222477	121176	36832	84344	...	30.24014	5.223841	11.46444	23.05031	16.80261	5.171365	15.04361	19.60642	26.44920	0.6441191