项目作者: msramalho

项目描述 :
Big Data & Cloud Computing - PySpark, Dask, GCP, ...
高级语言: HTML
项目地址: git://github.com/msramalho/fcup-bdcc.git
创建时间: 2019-03-20T11:44:02Z
项目社区:https://github.com/msramalho/fcup-bdcc

开源协议:

下载


Fcup-bdcc

Big Data & Cloud Computing

Part 1

Use Spark and Google Cloud Platform to perform analysis on increasing difficulty samples of the MovieLens dataset.

Example of Pyspark use:

  1. def recommendByTag(singleTag, TFIDF_tags, movies, min_fmax=10, numberOfResults=10, debug=False):
  2. # start by most complexity-reducing operation: filter
  3. # filter by the singleTag
  4. # remove entries with f_max < min_fmax
  5. df_tag = TFIDF_tags.filter(TFIDF_tags.tag == singleTag)\
  6. .filter(TFIDF_tags.f_max >= min_fmax)
  7. # join to get movie title
  8. # order by descending TFIDF + ascending lexicographic title
  9. # remove unnecessary columns
  10. # return results limited to numberOfResults
  11. df = df_tag.join(movies, 'movieId')\
  12. .orderBy(['TF_IDF','title'], ascending=[0,1])\
  13. .select('movieId', 'title', 'TF_IDF')\
  14. .limit(numberOfResults)
  15. return df

Part 2

Open problem of using Big Data tools and techniques to analyse a 32GB+ dataset of hospital events. Besides GCP, we used DASK and dask-ml.

Example Plot

Dask-Distributed Dashboard

Authors