Ask Ubuntu Logs analysis with Hadoop, MapReduce 2(Yarn)
The project involves extracting data from several smaller datasets and combining them together to do analysis.
The dataset contains logs for askubuntu stackexchange logs (https://askubuntu.com) in XML format.
The total size of the dataset is 22 GB.
It is stored in the GCS bucket gs://stackoverflow-dataset-677
The dataset includes multiple xml files corresponding to different attributes of the dataset.
The following are the relevant features for each XML file.
mvn compile
gs://stackoverflow-dataset-677
Remember to compile the project, before running jobs (if not done for previous section)
mvn compile
WeeklyPostCountJob.java
TweetMapper.java
TweetWritable.java
TweetCountReducer.java
Usage:
<Job.java> <input_file_location> <output_file_location> <positive_word_file_path> <negative_word_file_path>
Example Output :
Ordered by total number of tweets posted by users in descending order.```
Week 8 Total posts 2
2014-02-18 13:33 Title: How can I set the Software Center to install software for non-root users?
Sentiment: 2.38% positive Score: 48 Answer Count: 5
2014-02-18 13:34 Title: What are some alternatives to upgrading without using the standard upgrade system?
Sentiment: neutral Score: 22 Answer Count: 2
Week 13 Total posts 1
2013-03-29 05:00 Title: How do I go back to KDE splash / login after installing XFCE?
Sentiment: 2.85% negative Score: 18 Answer Count: 4
Week 30 Total posts 1
2014-07-22 19:53 Title: How do I enable automatic updates?
Sentiment: neutral Score: 142 Answer Count: 5
Week 33 Total posts 1
2010-08-22 02:10 Title: How do I run a successful Ubuntu Hour?
Sentiment: 5.08% positive Score: 26 Answer Count: 6
Week 49 Total posts 1
2017-12-10 23:38 Title: How to graphically interface with a headless server?
Sentiment: 3.10% positive Score: 41 Answer Count: 9
Week 51 Total posts 1
2014-12-16 01:47 Title: How to get the Your battery is broken message to go away?
Sentiment: 11.11% negative Score: 61 Answer Count: 4
```
- [Actual Output](./output/5%20StackOverflow-Posts.txt)
HighestRepUserCountJob.java
HighestRepUserMapper.java
HightestRepUserReducer.java
LocationWritable.java
<Job.java> <input_file_location> <output_file_location>
Example Output :
Ordered by count of week-wise top 5 trending tweets in descending order.
Algeria Total Users :1
[
User : Px (since 2010-10-14) reputation=0 upvotes=0 views=31 website=http://pixelmed.wordpress.com/]
Argentina Total Users :13
[
User : Ither (since 2010-11-30) reputation=52 upvotes=52 views=23 website=NA,
User : bruno077 (since 2010-10-03) reputation=35 upvotes=35 views=46 website=http://bonamin.org,
User : vicmp3 (since 2010-10-22) reputation=31 upvotes=31 views=91 website=NA,
User : Axel (since 2010-08-04) reputation=19 upvotes=19 views=47 website=http://localhost:8084,
User : Sebastián (since 2011-01-16) reputation=14 upvotes=14 views=26 website=NA]
.
.
.
.
Zimbabwe Total Users :1
[
User : coolmac (since 2011-05-09) reputation=0 upvotes=0 views=5 website=http://www.twitter.com/mukwenhac]
Analyzing Comments file from stackoverflow data, to find the top 5 postIds, the sentiment associated with these comments, the users who have made these comments, their scores, dates on which these comments are made etc.
File Path :/home/ayush/comments.xml
Analyzing Tags file from stackoverflow data, to find the top 5 tags, the excerptPostId and wikiPostId etc and displaying the data through the Reducer.
/home/ayush/tags.xml