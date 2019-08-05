Use Hacker Noon's RSS Feed
Free & Open Source Advocate. Data Geek - Big or Small.
Python/R scripts
Data sets
— includes journal articles, slides, other documents
Reference materials
Notebooks
Notes
Scala sources (if using spark)
— usually, a source of inspiration, methodology or case studies
Cloned repository of other projects relevant to the current work
for data transfer, data clean-up or even for runners.sh to submit jobs on a cluster. I always have a runner.sh that contains yarn settings for spark-submit
Other scripts
Requirements gathering
ETL data from sources using python, R or scala
- perform descriptive statistics on data to validate whether it reflects business facts. This takes sometime, even on collaborative environment where business and data scientists are working closely. In addition, data calibration is also needed to further verify business facts.
Data calibration
- with data validated and calibrated, A Data Scientist can now start working on generating insights - producing notebooks, scripts or scala jars. Notes, journal articles and other references will add to the clutter in the working directory.
Data Science and Insights Generation
- reports for business are consolidated in a presentation from outputs of various visualization tools (png files, tableau workbooks)
Visualization and Reports creation
- if the study is to be operationalized, prototypes are built as Data Engineers guide.
PySpark or Spark jobs sources for operationalization
ansible playbooks are created to automated repeatitive tasks
ansible-playbooks:
all data sets (toy, final, intermediate aggregates, etc). I would usually have to sub directories, for (1) datasets generated in the cluster (we're running on a spark environment), (2) locally generated
data:
with subdirectories for notebooks running on the cluster and locally
Notebooks:
pdfs, journal articles, referencesrepo: for all python, scala and R scripts, organized as repo/src/python/main/R, repo/src/python/lib (for various utilities), repo/src/main (for scala codes).
References:
is organized like this to allow easy compilation of scala codes using maven build.
repo
all reports goes here
Reports:
/*
**/.DS_Store
**/.ipynb_checkpoints
**/*.log
repo/src/python/lib/
!/resources
!/notebooks
!/repo
!/ansible
!/data
!/.gitignore