Build a Recommendation System that suggests businesses to users based on the location, stars, category, reviews and business hours.
The json has been converted into csv files using the sample code provided by the Yelp at https://github.com/Yelp/dataset-examples. The converted CSV can’t be loaded into Github due to large size.
Converting all files from json to csv for easier reading and utlizing the existing resources. [ref: https://github.com/Yelp/dataset-examples]
Loading business data. Dropping variables [‘type’,’postal_code’,’latitude’,’longitude’,’review_count’,’attributes’,’address’,’is_open’,’hours’,’stars’]
Filtered data to contain only state ‘NC’
Loading review data. keeping only business id, user id, stars, funny, cool, and useful votes.
Merging both datasets by business id.
Updating value of the rating ‘stars’ by the formula (definiately it includes bias):
if stars>=3
stars=stars+(0.1*funny+.6*useful+0.3*cool)/(funny+useful+cool+1)
else
stars=stars-(0.1*funny+.6*useful+0.3*cool)/(funny+useful+cool+1)
sc
variable)|_ Work: where all code resides.
|_ Documents: All the documents provided by the professor.
|_ Sample Data: The filtered data files and sample initial files.
├── Work\
| ├── ReadData.py
| ├── business_recommender.ipynb
├── Documents\
| ├── Business Questions.txt
| ├── FinalProjectProblemsProposal.pdf
├── Sample_Data\
| ├── business.txt
| ├── checkin.txt
| ├── merged_BR3.csv
| ├── tip.txt
Works Folder Details:
IPYTHON_OPTS="notebook" $SPARK_HOME/bin/pyspark
. Using the run button, the code can be executed.