项目作者: MNoorFawi

项目描述 :
simple text clustering using kmeans algorithm
高级语言: Python
项目地址: git://github.com/MNoorFawi/text-kmeans-clustering-with-python.git


Text kmeans clustering with python

simple text clustering using kmeans algorithm

  1. from sklearn.feature_extraction.text import TfidfVectorizer
  2. from sklearn.cluster import KMeans
  3. documents = ["the young french men crowned world champions",
  4. "Google Translate app is getting more intelligent everyday",
  5. "Facebook face recognition is driving me crazy",
  6. "who is going to win the Golden Ball title this year",
  7. "these camera apps are funny",
  8. "Croacian team made a brilliant world cup campaign reaching the final match",
  9. "Google Chrome extensions are useful.",
  10. "Social Media apps leveraging AI incredibly",
  11. "Qatar 2022 FIFA world cup is played in winter"]
  12. vectorizer = TfidfVectorizer(stop_words = 'english')
  13. data = vectorizer.fit_transform(documents)
  14. true_k = 2
  15. clustering_model = KMeans(n_clusters = true_k,
  16. init = 'k-means++',
  17. max_iter = 300, n_init = 10)
  18. clustering_model.fit(data)
  19. ## terms per cluster
  20. sorted_centroids = clustering_model.cluster_centers_.argsort()[:, ::-1]
  21. terms = vectorizer.get_feature_names()
  22. for i in range(true_k):
  23. print("Cluster %d:" % i, end='')
  24. for ind in sorted_centroids[i, :10]:
  25. print(' %s' % terms[ind], end='')
  26. print()
  27. print()
  28. print()
  29. # Cluster 0: apps google funny camera extensions useful chrome driving face facebook
  30. #
  31. # Cluster 1: world cup young champions crowned french men qatar fifa played
  32. ## predicting the cluster of new docs
  33. new_doc = ["how to install Chrome"]
  34. Y = vectorizer.transform(new_doc)
  35. prediction = clustering_model.predict(Y)
  36. print(prediction)
  37. # [0]
  38. new_doc = ["UCL Final match is played in Madrid this year"]
  39. Y = vectorizer.transform(new_doc)
  40. prediction = clustering_model.predict(Y)
  41. print(prediction)
  42. # [1]