Update: Anaconda released a patch! if you re-install numpy 1.15.2 (on linux at least) you should notice that your build number goes from 1.15.2-py37h1d66e8a_0 → 1.15.2-py37h1d66e8a_1. Those builds aren’t available on PyPi. Numpy 1.15.3 also contains a fix. I don’t recommend using older versions of NumPy with Python 3.7 We operate a cloud hosted Jupyter notebook service. Get started today ! There is what I consider to be a critical bug right now for data science workflows and Python 3.7. The bug is not in Python, but in NumPy where casting errors are sometimes swallowed. This means that you can do things like cast strings to complex numbers, and NumPy might not throw an exception. Pandas relies on NumPy’s error handling to determine whether your dataframe is all numeric. So sometimes, taking the mean of dataframes can give you your results as complex numbers. Everyone should stay on Python 3.6 until the NumPy fix is released. The relevant github issues are in NumPy , , and in Pandas , . You can circumvent the issue by passing into your call to but it is unlikely that you are already doing so. The fix to NumPy has been merged. Once it's released you should upgrade. #11993 #12062 #22506 #22753 numeric_only=True .mean NumPy Bug If we look at the for converting objects into other types: code in NumPy static voidOBJECT_to_@TOTYPE@(void *input, void *output, npy_intp n,void *NPY_UNUSED(aip), void *aop){PyObject **ip = input;@totype@ *op = output; npy\_intp i;  
int skip = @skip@;  

for (i = 0; i < n; i++, ip++, op += skip) {  
    if (\*ip == NULL) {  
        @TOTYPE@\_setitem(Py\_False, op, aop);  
    }  
    else {  
        @TOTYPE@\_setitem(\*ip, op, aop);  
    }  
} } This code does not quit when there is a problem in the call. discovered and fixed it to be @TOTYPE@_setitem @ahaldane static voidOBJECT_to_@TOTYPE@(void *input, void *output, npy_intp n,void *NPY_UNUSED(aip), void *aop){PyObject **ip = input;@totype@ *op = output; npy\_intp i;  
int skip = @skip@;  

for (i = 0; i < n; i++, ip++, op += skip) {  
    if (\*ip == NULL) {  
        if (@TOTYPE@\_setitem(Py\_False, op, aop) < 0) {  
            return;  
        }  
    }  
    else {  
        if (@TOTYPE@\_setitem(\*ip, op, aop) < 0) {  
            return;  
        }  
    }  
} } Without quitting the loop, subsequent calls probably invoke some CPython code which was changed in 3.7 to call . By the way, if that code looks strange to you - it's because NumPy uses it's own template engine. PyErr_Clear Pandas Impact This can certainly have more impact than what I’m describing here, but the most immediate impact is that sometimes aggregating Dataframes with mixed types results in complex results. To illustrate the un-predictableness of this problem, try the following example: df = pd.DataFrame({"user":["A", "A", "A", "A", "A"],"connections":[3.0, 4970.0, 4749.0, 4719.0, 4704.0],})df['connections2'] = df.connections.astype('int64')print()print('usually incorrect')print()print(df.mean())print()print(df.head())print()print('usually correct')print()print(df.mean()) I consistently get some output that looks like this: usually incorrect user            (1.38443408503753e-310+1.38443408513886e-310j)connections       (1.3844303826283e-310+1.3844336097506e-310j)connections2                                         (3829+0j)dtype: complex128 user  connections  connections20    A          3.0             31    A       4970.0          49702    A       4749.0          47493    A       4719.0          47194    A       4704.0          4704 usually correct connections     3829.0connections2    3829.0dtype: float64 To illustrate the un-predictableness, if I take out the call to , then I get complex results almost ever single time. If I put the initialization of connections2 into the dataframe constructor, I almost always get black floating point results, however sometimes, I incorrectly get a mean of "user" as 0.0 print(df.head()) The reason this happens is because the method relies on exceptions occuring when the reduction function is applied to . If that fails, Pandas attempts to , and then applies the function again. There are 2 attempted invocations and of the function before we actually get to the pure numeric data. _reduce .values extract only the numeric columns here here Conclusions Stay on Python 3.6 for now. When NumPy releases the fix, upgrade to that version. Thanks For Reading! We are aturn Cloud and we provide . If you want Jupyter managed for your team, or to take a class, check us out! S cloud hosted Jupyter notebooks Originally published at www.opensourceanswers.com .

You shouldn’t use Python 3.7 for data science right now

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Untitled Story

Building GroupMail, a group email service

What Are Convolution Neural Networks? [ELI5]

The Noonification: Have U Been Pwned? (1/12/2023)

Goldman Sachs, Data Lineage, and Harry Potter Spells

People are still crazy about Python after twenty-five years

Building GroupMail, a group email service

What Are Convolution Neural Networks? [ELI5]

The Noonification: Have U Been Pwned? (1/12/2023)

Goldman Sachs, Data Lineage, and Harry Potter Spells

People are still crazy about Python after twenty-five years

Light-Mode

Classic

Newspaper

Dark-Mode

Neon Noir

Minty

HN StartUps