☀️🦶 A lightweight framework for collaborative, open-source, data science
A lightweight framework for collaborative, open-source data science
projects through feature engineering.
While the open-source model for software development has led to successful, large-scale collaborations in building software applications, chess engines, and scientific analyses, data science has not benefited from this development paradigm. In part, this is due to the divide between the development processes used by software engineers and those used by data scientists.
Ballet tries to address this disparity. It is a lightweight software framework that supports collaborative data science development by composing a data science pipeline from a collection of modular patches that can be written in parallel. Ballet provides the underlying functionality to support interactive development, test and merge high-quality contributions, and compose the accepted contributions into a single product.
We have deployed Ballet for feature engineering collaborations on tabular survey datasets of public interest. For example, predict-census-income is a large real-world collaborative project to engineer features from raw individual survey responses to the U.S. Census American Community Survey (ACS) and predict personal income. The resulting project is one of the largest data science collaborations GitHub, and outperforms state-of-the-art tabular AutoML systems and independent data science experts.
Ballet includes several different pieces for enabling collaborative data science.
ballet.feature
)ballet.pipeline
)ballet.tranformer
, ballet.eng
)ballet.validation
)ballet.contrib
)ballet/templates/project_template
, ballet.update
)ballet.cli
)ballet.project
)ballet.client
)Are you a data owner or project maintainer that wants to organize a
collaboration?
👉 Check out the Ballet Maintainer Guide
Are you a data scientist or enthusiast that wants to join a collaboration?
👉 Check out the Ballet Contributor Guide
Do you want to learn about how Ballet enables Better Feature Engineering™️?
👉 Check out the Feature Engineering Guide
You can also read our research paper about the Ballet framework and our case study analysis, which appeared at ACM CSCW 2021:
👉 Enabling Collaborative Data Science Development with the Ballet Framework
The Ballet GitHub organization hosts several ongoing Ballet collaborations:
If you use Ballet in your work, please consider citing the following paper:
@article{smith2021enabling,
author = {Smith, Micah J. and Cito, J{\"u}rgen and Lu, Kelvin and Veeramachaneni, Kalyan},
title = "Enabling Collaborative Data Science Development with the {Ballet} Framework",
year = "2021",
month = "October",
volume = "5",
pages = "1--39",
doi = "10.1145/3479575",
journal = "Proceedings of the {ACM} on Human-Computer Interaction",
publisher = "{ACM}",
language = "en",
number = "CSCW2"
}