项目作者: sudheera96

项目描述 :
Project on word count using pySpark, data bricks cloud environment.
高级语言: Jupyter Notebook
项目地址: git://github.com/sudheera96/pyspark-textprocessing.git
创建时间: 2021-04-19T00:42:42Z
项目社区:https://github.com/sudheera96/pyspark-textprocessing

开源协议:

下载


Size Limit logo by Anton Lovchikov

Sri Sudheera Chitipolu

I am Sri Sudheera Chitipolu, currently pursuing Masters in Applied Computer Science, NWMSU, USA. Also working as Graduate Assistant for Computer Science Department. Consistently top performer, result oriented with a positive attitude.

I am certified in

AWS Cloud Practitioner IBM Bigdata Fundamentals H2o.ai kubernetes,containers

Find Me elsewhere

PySpark Text processing .ipynb)

PySpark Text processing is the project on word count from a website content and visualizing the word count in bar chart and word cloud.

https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6374047784683966/198390003695466/3813842128498967/latest.html
(valid for 6 months)

Used Bigdata skills

  • Databricks Cloud Environment
  • Spark Processing Engine
  • PySpark API
  • Python Programming Language
  • Word cloud

Text Resource

The Project Gutenberg EBook of Little Women, by Louisa May Alcott

Commands

Data Gathering

We’ll use the library urllib.request to pull the data into the notebook in the notebook. Then, once the book has been brought in, we’ll save it to /tmp/ and name it littlewomen.txt.

  1. import urllib.request
  2. urllib.request.urlretrieve("https://www.gutenberg.org/cache/epub/514/pg514.txt" , "/tmp/littlewomen.txt")

Now it’s time to put the book away. There are two arguments to the dbutils.fs.mv method. The first point of contention is where the book is now, and the second is where you want it to go. The first argument must begin with file:, followed by the position. The second argument should begin with dbfs: and then the path to the file you want to save. Our file will be saved in the data folder.

  1. dbutils.fs.mv("file:/tmp/littlewomen.txt","dbfs:/data/littlewomen.txt")

Transferring the file into Spark is the final move. RDDs, or Resilient Distributed Datasets, are where Spark stores information. As a result, we’ll be converting our data into an RDD. Spark is abbreviated to sc in Databrick. When entering the folder, make sure to use the new file location.

  1. LittleWomenRawRDD= sc.textFile("dbfs:/data/littlewomen.txt")

Cleaning the data

Capitalization, punctuation, phrases, and stopwords are all present in the current version of the text. Stopwords are simply words that improve the flow of a sentence without adding something to it. Consider the word “the.” The first step in determining the word count is to flatmap and remove capitalization and spaces. The term “flatmapping” refers to the process of breaking down sentences into terms.

  1. LittleWomenMessyTokensRDD = LittleWomenRawRDD.flatMap(lambda line: line.lower().strip().split(" "))

The next step is to eliminate all punctuation. This would be accomplished by the use of a standard expression that searches for something that isn’t a message. We’ll need the re library to use a regular expression.

  1. import re
  2. LittleWomenCleanTokensRDD = LittleWomenMessyTokensRDD.map(lambda letter: re.sub(r'[^A-Za-z]', '', letter))

We must delete the stopwords now that the words are actually words. Since PySpark already knows which words are stopwords, we just need to import the StopWordsRemover library from pyspark. Then, from the library, filter out the terms.

  1. from pyspark.ml.feature import StopWordsRemover
  2. remover = StopWordsRemover()
  3. stopwords = remover.getStopWords()
  4. LittleWomenWordsRDD = LittleWomenCleanTokensRDD.filter(lambda PointLessW: PointLessW not in stopwords)

To remove any empty elements, we simply just filter out anything that resembles an empty element.

  1. LittleWomenEmptyRemoveRDD = LittleWomenWordsRDD.filter(lambda x: x != "")

Data processing

To process data, simply change the words to the form (word,1), count how many times the word appears, and change the second parameter to that count. The first move is to: Words are converted into key-value pairs.

  1. LittleWomenPairsRDD = LittleWomenEmptyRemoveRDD.map(lambda word: (word,1))

Reduce by key in the second stage. The word is the answer in our situation. The first time the word appears in the RDD will be held. If it happens again, the word will be removed and the first words counted.

  1. LittleWomenWordCountRDD = LittleWomenPairsRDD.reduceByKey(lambda acc, value: acc + value)

Finally, we’ll use sortByKey to sort our list of words in descending order. We’ll use take to take the top ten items on our list once they’ve been ordered. Finally, we’ll print our results to see the top 10 most frequently used words in Frankenstein in order of frequency.

  1. LittleWomenResults = LittleWomenWordCountRDD.map(lambda x: (x[1], x[0])).sortByKey(False).take(10)
  2. print(LittleWomenResults)

Charting

Pandas, MatPlotLib, and Seaborn will be used to visualize our performance.

  1. import pandas as pd
  2. import matplotlib.pyplot as plt
  3. import seaborn as sns
  4. source = 'The Project Gutenberg EBook of Little Women, by Louisa May Alcott'
  5. title = 'Top Words in ' + source
  6. xlabel = 'Count'
  7. ylabel = 'Words'
  8. df = pd.DataFrame.from_records(LittleWomenResults, columns =[xlabel, ylabel])
  9. plt.figure(figsize=(10,3))
  10. sns.barplot(xlabel, ylabel, data=df, palette="viridis").set_title(title)

Word Count

Word cloud

We even can create the word cloud from the word count. We require nltk, wordcloud libraries.

  1. import nltk
  2. import wordcloud
  3. import matplotlib.pyplot as plt
  4. from nltk.corpus import stopwords
  5. from nltk.tokenize import word_tokenize
  6. from wordcloud import WordCloud
  7. class WordCloudGeneration:
  8. def preprocessing(self, data):
  9. # convert all words to lowercase
  10. data = [item.lower() for item in data]
  11. # load the stop_words of english
  12. stop_words = set(stopwords.words('english'))
  13. # concatenate all the data with spaces.
  14. paragraph = ' '.join(data)
  15. # tokenize the paragraph using the inbuilt tokenizer
  16. word_tokens = word_tokenize(paragraph)
  17. # filter words present in stopwords list
  18. preprocessed_data = ' '.join([word for word in word_tokens if not word in stop_words])
  19. print("\n Preprocessed Data: " ,preprocessed_data)
  20. return preprocessed_data
  21. def create_word_cloud(self, final_data):
  22. # initiate WordCloud object with parameters width, height, maximum font size and background color
  23. # call the generate method of WordCloud class to generate an image
  24. wordcloud = WordCloud(width=1600, height=800, max_words=10, max_font_size=200, background_color="black").generate(final_data)
  25. # plt the image generated by WordCloud class
  26. plt.figure(figsize=(12,10))
  27. plt.imshow(wordcloud)
  28. plt.axis("off")
  29. plt.show()
  30. wordcloud_generator = WordCloudGeneration()
  31. # you may uncomment the following line to use custom input
  32. # input_text = input("Enter the text here: ")
  33. import urllib.request
  34. url = "https://www.gutenberg.org/cache/epub/514/pg514.txt"
  35. request = urllib.request.Request(url)
  36. response = urllib.request.urlopen(request)
  37. input_text = response.read().decode('utf-8')
  38. input_text = input_text.split('.')
  39. clean_data = wordcloud_generator.preprocessing(input_text)
  40. wordcloud_generator.create_word_cloud(clean_data)

Word Cloud

Errors

If we face any error by above code of word cloud then we need to install and download wordcloud ntlk and popular to over come error for stopwords.

  1. pip install wordcloud
  2. pip install nltk
  3. nltk.download('popular')

Save charts

If we want to run the files in other notebooks, use below line of code for saving the charts as png.

  1. plt.savefig('LittleWomen_Results.png')

Insights

From the word count charts we can conclude that important characters of story are Jo, meg, amy, Laurie. Good word also repeated alot by that we can say the story mainly depends on good and happiness.

References