项目作者: shukkkur

项目描述 :
Analyzing the gender distribution of children's book writers and use sound to match names to gender.
高级语言: Jupyter Notebook
项目地址: git://github.com/shukkkur/Gender-Prediction-using-Sound.git
创建时间: 2021-04-30T12:37:53Z
项目社区:https://github.com/shukkkur/Gender-Prediction-using-Sound

开源协议:

下载


Gender-Prediction-using-Sound

Forks
Stars
Watchers
Last Commit

The same name can be spelled out in a many ways, for example, Marc and Mark. Sound can, therefore, be a better way to match names than spelling.In this project, I will use the Python package Fuzzy to find out the genders of authors that have appeared in the New York Times Best Seller list for Children’s Picture books.


1. Sound it out!



Grey and Gray. Colour and Color. Words like these have been the cause of many heated arguments between Brits and Americans.
One way to tackle this challenge is to write a program that checks if two strings sound the same, instead of checking for equivalence in spellings. We’ll do that here using fuzzy name matching.

  1. print(fuzzy.nysiis('gray'))
  2. >>> GRY
  3. fuzzy.nysiis('colour') == fuzzy.nysiis('color')
  4. >>> True

2. Authoring the authors

Let’s begin by reading in the data on the best selling authors from 2008 to 2017.

  1. author_df = pd.read_csv('datasets/nytkids_yearly.csv', delimiter=';')
  2. first_name = []
  3. for name in author_df['Author']:
  4. first_name.append(name.split()[0])
  5. author_df['first_name'] = first_name
  6. author_df.head()
Year Book Title Author Besteller this year first_name
0 2017 DRAGONS LOVE TACOS Adam Rubin 49 Adam
1 2017 THE WONDERFUL THINGS YOU WILL BE Emily Winfield Martin 48 Emily
2 2017 THE DAY THE CRAYONS QUIT Drew Daywalt 44 Drew
3 2017 ROSIE REVERE, ENGINEER Andrea Beaty 38 Andrea
4 2017 ADA TWIST, SCIENTIST Andrea Beaty 28 Andrea

3. Time to bring on the phonics!


When we were young children, we were taught to read using phonics; sounding out the letters that compose words. So let’s relive history and do that again, but using python this time.



python nysiis_name = [] for name in author_df['first_name']: nysiis_name.append(fuzzy.nysiis(name)) author_df['nysiis_name'] = nysiis_name author_df.head()

| | Year | Book Title | Author | Besteller this year | first_name | nysiis_name |
|—:|——-:|————————————————-:|———————————:|——————————:|—————-:|——————:|
| 0 | 2017 | DRAGONS LOVE TACOS | Adam Rubin | 49 | Adam | ADAN |
| 1 | 2017 | THE WONDERFUL THINGS YOU WILL BE | Emily Winfield Martin | 48 | Emily | ENALY |
| 2 | 2017 | THE DAY THE CRAYONS QUIT | Drew Daywalt | 44 | Drew | DR |
| 3 | 2017 | ROSIE REVERE, ENGINEER | Andrea Beaty | 38 | Andrea | ANDR |
| 4 | 2017 | ADA TWIST, SCIENTIST | Andrea Beaty | 28 | Andrea | ANDR |


4. The inbetweeners


We’ll use babynames_nysiis.csv, a dataset that is derived from the Social Security Administration’s baby name data, to identify author genders. The dataset contains unique NYSIIS versions of baby names, and also includes the percentage of times the name appeared as a female name (perc_female) and the percentage of times it appeared as a male name (perc_male).



babies_df = pd.read_csv('datasets/babynames_nysiis.csv', delimiter=';') gender = [] for i in range(len(babies_df)): if babies_df.iloc[i]['perc_male'] > babies_df.iloc[i]['perc_female']: gender.append('M') elif babies_df.iloc[i]['perc_male'] < babies_df.iloc[i]['perc_female']: gender.append('F') else: gender.append('N') babies_df['gender'] = gender babies_df.head()

| | babynysiis | perc_female | perc_male | gender |
|—:|—————-:|——————:|—————:|———-:|
| 0 | NaN | 62.50 | 37.50 | F |
| 1 | RAX | 63.64 | 36.36 | F |
| 2 | ESAR | 44.44 | 55.56 | M |
| 3 | DJANG | 0.00 | 100.00 | M |
| 4 | PARCAL | 25.00 | 75.00 | M |

5. Playing matchmaker

Now that we have identified the likely genders of different names, let’s find author genders by searching for each author’s name in the babies_df DataFrame, and extracting the associated gender.

  1. def locate_in_list(a_list, element):
  2. loc_of_name = a_list.index(element) if element in a_list else -1
  3. return(loc_of_name)
  4. author_gender = []
  5. for name in author_df['nysiis_name']:
  6. nloc = locate_in_list(list(babies_df['babynysiis']), name)
  7. if nloc == -1:
  8. author_gender.append('Unknown')
  9. else:
  10. author_gender.append(babies_df['gender'][nloc])
  11. author_df['author_gender'] = author_gender
  12. author_df['author_gender'].value_counts()
F 395
M 191
Unknown 9
N 8
Name: author_gender, dtype: int64

6. Tally up

From the results above see that there are more female authors on the New York Times best seller’s list than male authors. Our dataset spans 2008 to 2017. Let’s find out if there have been changes over time.

  1. years = sorted(author_df.Year.unique())
  2. males_by_yr = []
  3. females_by_yr = []
  4. unknown_by_yr = []
  5. for year in years:
  6. males_by_yr.append(len(author_df[(author_df['author_gender']=='M') & (author_df['Year']==year)]))
  7. females_by_yr.append(len(author_df[(author_df['author_gender']=='F') & (author_df['Year']==year)]))
  8. unknown_by_yr.append(len(author_df[(author_df['author_gender']=='Unknown') & (author_df['Year']==year)]))
  9. males_by_yr
  10. >>> [8, 19, 27, 21, 21, 11, 21, 18, 25, 20]

7. Foreign-born authors?

Our gender data comes from social security applications of individuals born in the US. Hence, one possible explanation for why there are “unknown” genders associated with some author names is because these authors were foreign-born.

  1. years_shifted = list(np.array(years) + 0.25)
  2. plt.bar(years, males_by_yr, width=0.25, color='lightblue')
  3. plt.bar(years_shifted, females_by_yr, width=0.25, color='pink')
  4. plt.xlabel('years')
  5. plt.show()