Rick Lamers

How I built a spreadsheet app with Python to make data science easier

Today I'm open sourcing "Grid studio", a web-based spreadsheet application with full integration of the Python programming language.
About a year ago I started tinkering with the idea of building the data science IDE that I had always wanted. Having worked extensively with Microsoft Excel, R (Studio) and Python, I envisioned how some integrated version of those would make my life easier.

Why?

The main issue that I set out to solve with Grid studio is the scattered workflow that I was experiencing while going back and forth between multiple tools such as R studio and Excel while working on data science projects.
While exporting that CSV file for the gazillionth-time, running into freezing up of application windows when my row count was too high or trying to do something straightforward such as reading in a JSON file I had had enough. The existing tools did not provide me with the environment and associated workflow that enabled me to be productive.
That's why I decided to build something that would integrate my workflow into a single, modern and easy to use application fit for my data science needs.

How does it work?

Grid studio is a web-based application that looks remarkably similar to an ordinary spreadsheet program such as Google Sheets or Microsoft Excel. However, its killer feature is the deep integration of the Python language.
Viewing data in a tabular structure and manipulating it directly feels naturally to almost everybody who has used a computer.
Combining this simple UI with the power of a full fledged programming language such as Python really makes it stand out.
Scripting with Python is as straightforward as possible: just write a few lines and execute it directly.

Core integration: Reading and Writing to sheets

At the core of the Python integration is the read and write interface to your spreadsheet. A high performance connection between your sheets' data and data in your Python process.
Simply write to the sheet like so:
sheet("A1:A3", [1, 2, 3])
And read from the sheet like this:
my_matrix = sheet("A1:A3")
With this simple yet powerful function you can read and write directly from and to sheets in order to automate data entry, extraction, visualization and more.

Writing custom spreadsheet functions

While reading and writing gives you a lot of flexibility through a simple interface, sometimes it makes a lot of sense to write custom functions that can be called directly in your spreadsheet.
Common spreadsheet functions such as AVERAGE, SUM, IF, etc. are already available by default. But what if you need more?
Simply write the function you need!
def UPPERCASE(a):
    return str(a).uppercase()
Now call this function in your spreadsheet, just like you would a regular function.

Leveraging Python ecosystem

By leveraging the power of the Python ecosystem you get instant access to state of the art data science tools:
This enables simple access to powerful models, such as a linear regression and SVMs for modelling your data.

Docker runtime

The application runs in a Docker container which gives you easy access to a fully packed and isolated UNIX environment (even on Windows!) with everything ready to go: Python, scikit-learn, numpa, pandas, terminal, wget, zip, and much more.
This makes installing Grid studio as simple as downloading the prebuilt docker image and running a single command.

Data visualization

A common task in data science is visualizing your data. Given its importance Grid studio has built in support for advanced plotting by integrating interactive plotting library Plotly.js and Python's standard Matplotlib. This provides you with advanced plotting capabilities in vector sharp format.
To give you some ideas about how to use Grid studio's features we show how they can combined with some concrete examples.

Example: Scraping the web

This example shows you the power of having Python at your fingertips. Something that typically requires some back- and forth between tools and files can now be integrated into a single script.
Above, you see how a short script easily loads news articles from Hacker News directly into the sheet.
Source: scrape.py

Example: Estimating a normal distribution

This example shows a somewhat silly use case of estimating the normal distribution with ever higher fidelity visualized with Plotly.js. Here you can see how interactive plotting can give you a sense of what is going on.

How can I use it?

Installing Grid studio locally is very simple:
(Make sure you have Docker installed)

1. Clone the repository with this command:

git clone https://github.com/ricklamers/gridstudio

2. Run the bash script (on Windows use e.g. Git Bash) with this command:

cd gridstudio && ./run.sh

3. Go to http://127.0.0.1:8080 in your browser
Git Bash for Windows - Docker install
Note: if you run into issues feel free to open a issue on GitHub, I'll try to assist/fix as quickly as possible.
Note: on Linux you might need to run (for step 2):
cd gridstudio && sudo ./run.sh
as Docker requires sudo access to run.

Release + Future development

As mentioned in the introduction, today Grid studio will be available for free and open source through the GitHub repository.
If you made it here you are very welcome to try it out yourself and submit any feedback and/or contributions to the project on GitHub.
I have some ideas about which functionality or features could be added in the future to improve Grid studio. However, since the project is now open source I think it would be wise to track these on GitHub and see which have the highest priority according to all involved.
- Expanding the number of 'native' functions available in the spreadsheet (like AVERAGE, SUM, IF, etc), maybe even reaching parity (and consistency) with some existing packages like Libre Office's Calc or Excel
- Syntax highlight/function tooltips when typing formulas in the spreadsheet
- Advanced sorting and filtering in the spreadsheet
- Extended controls for interactive plotting with Plotly.js
- Sharing of workspaces/code more easily (i.e. export workspace)
- Forms of real-time collaboration (this might be too difficult)
- Some sort of API/interface for add-ons/extensions
- Upgrade formula parser to a real grammar based parser
- Python autocompletion
- Performance optimization
- Core Python/sheet integration robustness (no character/sequence breakage)

Open source motivation

While this project was originally intented for commercial release, I've decided it might be better off as an open source project for everyone to experiment with and potentially be developed by a small community of interested data enthousiasts.
The reason for this is that during the initial development of the project I've discovered a number of projects that offer similar functionality to Grid studio.
First, there is an open source plug-in that integrates Python directly into Microsoft Excel called xlwings. Although it does not really integrate spreadsheets and Python into a single coherent product, it does offer the advantage of giving users access to the 'real' fully loaded Excel environment they are already familiar with.
Second, Python has evolved from IPython to Jupyter Notebooks to JupyterLab. It enjoys a lot of popularity and rightly so, it offers a very nice work environment for data scientists with a strong emphasis on explainable code through long form Notebooks. Although, it lacks any kind of spreadsheet functionality that, in my opinion, is so appealing to novice data scientists because of its intuitive behaviour.
Overall, projects like these meant that commercializing Grid studio would mean competing with these product substitutes that are frankly available for the incredibly low price of free.
Regardless, I sincerely believe that Grid studio does have something unique to offer over existing alternatives and could be the tool of choice for quite a few use cases.
Grid studio is available for free and open source through theĀ GitHub repository.

Tags

Comments

July 26th, 2019

This is awesome, thanks for building it! Have you thought about turning it into a SaaS service?

July 26th, 2019

Small spelling correction: “… is so appealing to novice data scientists because of its untuitive behaviour.” should probably be “… is so appealing to novice data scientists because of its intuitive behaviour.”.

July 26th, 2019

Hi voytekg,

Thank you for the compliment. I have thought about turning it into SaaS, I’m not sure if there’s enough demand for it. At the moment I would like to develop it through open-source collaboration with a community of adopters. Maybe in the future it makes sense to offer a SaaS platform around it.

Cheers!

July 26th, 2019

Good catch! I’ve edited it, but it doesn’t appear to propagate to the live version. Opened a post about it with the editor.

July 27th, 2019

(post withdrawn by author, will be automatically deleted in 24 hours unless flagged)

July 27th, 2019

Thanks a bunch!

I’m a bit surprised that you find the code base clean, I personally had some issues with it. But I didn’t want to wait too long with shipping :smile: .

I did consider multiple language backends, but I felt that it would become too much for me to manage. It would require a better interface between the Go spreadsheet part and the language of choice, plus a whole lot more testing on my end.

Now that it’s in the open source realm I wouldn’t mind supporting more languages if someone has a good use case and a pull request :smiley:

July 27th, 2019

No way! Just yesterday I found myself using python and Excel and R to do some of the calculations and plotting. This is great! I would love to help you out continuing this project and I will!

July 27th, 2019

Rick, you are a genius!
I am not data sciencist, just a guy who like to play with data but is not a master of spreadsheets and python, still know some of these and love both. So, Grid Studio looks like specially made for me :stuck_out_tongue_winking_eye:

July 28th, 2019

Thanks maciek! I hope it ends up being for you!

July 28th, 2019

Quick question – do you do any sandboxing of the Python scripts? Wondering if I use this in a shared environment (e.g., with my colleagues/employees) whether they can access the general OS, etc. via Python.

July 28th, 2019

There are two options currently: run it on your local computer or use the SaaS option that spawns your personal virtual machine on the DigitalOcean cloud environment (https://dashboard.gridstudio.io). In both cases your workspace Python session is isolated to the Docker container environment.

Does that make sense?

August 3rd, 2019

Second, Python has evolved from IPython to Jupyter Notebooks to JupyterLab. It enjoys a lot of popularity and rightly so, it offers a very nice work environment for data scientists with a strong emphasis on explainable code through long form Notebooks. Although, it lacks any kind of spreadsheet functionality that, in my opinion, is so appealing to novice data scientists because of its intuitive behaviour.

QGrid is one way to work with DataFrames as spreadsheets in Jupyter Notebook and JupyterLab

re: Jupyter and 2d projections of data
https://news.ycombinator.com/item?id=19087358

August 3rd, 2019

QGrid looks really interesting. I’m going to try it out for sure.

Topics of interest