v0.20.1 (May 5, 2017)

This is a major release from 0.19.2 and includes a number of API changes, deprecations, new features, enhancements, and performance improvements along with a large number of bug fixes. We recommend that all users upgrade to this version.

Highlights include:

  • New .agg() API for Series/DataFrame similar to the groupby-rolling-resample API’s, see here
  • Integration with the feather-format, including a new top-level pd.read_feather() and DataFrame.to_feather() method, see hereopen in new window.
  • The .ix indexer has been deprecated, see here
  • Panel has been deprecated, see here
  • Addition of an IntervalIndex and Interval scalar type, see here
  • Improved user API when grouping by index levels in .groupby(), see here
  • Improved support for UInt64 dtypes, see here
  • A new orient for JSON serialization, orient='table', that uses the Table Schema spec and that gives the possibility for a more interactive repr in the Jupyter Notebook, see here
  • Experimental support for exporting styled DataFrames (DataFrame.style) to Excel, see here
  • Window binary corr/cov operations now return a MultiIndexed DataFrame rather than a Panel, as Panel is now deprecated, see here
  • Support for S3 handling now uses s3fs, see here
  • Google BigQuery support now uses the pandas-gbq library, see here

Warning

Pandas has changed the internal structure and layout of the code base. This can affect imports that are not from the top-level pandas.* namespace, please see the changes here.

Check the API Changes and deprecations before updating.

Note

This is a combined release for 0.20.0 and and 0.20.1. Version 0.20.1 contains one additional change for backwards-compatibility with downstream projects using pandas’ utils routines. (GH16250open in new window)

What’s new in v0.20.0

New features

agg API for DataFrame/Series

Series & DataFrame have been enhanced to support the aggregation API. This is a familiar API from groupby, window operations, and resampling. This allows aggregation operations in a concise way by using agg()open in new window and transform()open in new window. The full documentation is hereopen in new window (GH1623open in new window).

Here is a sample

In [1]: df = pd.DataFrame(np.random.randn(10, 3), columns=['A', 'B', 'C'],
   ...:                   index=pd.date_range('1/1/2000', periods=10))
   ...: 

In [2]: df.iloc[3:7] = np.nan

In [3]: df
Out[3]: 
                   A         B         C
2000-01-01  0.469112 -0.282863 -1.509059
2000-01-02 -1.135632  1.212112 -0.173215
2000-01-03  0.119209 -1.044236 -0.861849
2000-01-04       NaN       NaN       NaN
2000-01-05       NaN       NaN       NaN
2000-01-06       NaN       NaN       NaN
2000-01-07       NaN       NaN       NaN
2000-01-08  0.113648 -1.478427  0.524988
2000-01-09  0.404705  0.577046 -1.715002
2000-01-10 -1.039268 -0.370647 -1.157892

[10 rows x 3 columns]

One can operate using string function names, callables, lists, or dictionaries of these.

Using a single function is equivalent to .apply.

In [4]: df.agg('sum')
Out[4]: 
A   -1.068226
B   -1.387015
C   -4.892029
Length: 3, dtype: float64

Multiple aggregations with a list of functions.

In [5]: df.agg(['sum', 'min'])
Out[5]: 
            A         B         C
sum -1.068226 -1.387015 -4.892029
min -1.135632 -1.478427 -1.715002

[2 rows x 3 columns]

Using a dict provides the ability to apply specific aggregations per column. You will get a matrix-like output of all of the aggregators. The output has one column per unique function. Those functions applied to a particular column will be NaN:

In [6]: df.agg({'A': ['sum', 'min'], 'B': ['min', 'max']})
Out[6]: 
            A         B
max       NaN  1.212112
min -1.135632 -1.478427
sum -1.068226       NaN

[3 rows x 2 columns]

The API also supports a .transform() function for broadcasting results.

In [7]: df.transform(['abs', lambda x: x - x.min()])
Out[7]: 
                   A                   B                   C          
                 abs  <lambda>       abs  <lambda>       abs  <lambda>
2000-01-01  0.469112  1.604745  0.282863  1.195563  1.509059  0.205944
2000-01-02  1.135632  0.000000  1.212112  2.690539  0.173215  1.541787
2000-01-03  0.119209  1.254841  1.044236  0.434191  0.861849  0.853153
2000-01-04       NaN       NaN       NaN       NaN       NaN       NaN
2000-01-05       NaN       NaN       NaN       NaN       NaN       NaN
2000-01-06       NaN       NaN       NaN       NaN       NaN       NaN
2000-01-07       NaN       NaN       NaN       NaN       NaN       NaN
2000-01-08  0.113648  1.249281  1.478427  0.000000  0.524988  2.239990
2000-01-09  0.404705  1.540338  0.577046  2.055473  1.715002  0.000000
2000-01-10  1.039268  0.096364  0.370647  1.107780  1.157892  0.557110

[10 rows x 6 columns]

When presented with mixed dtypes that cannot be aggregated, .agg() will only take the valid aggregations. This is similar to how groupby .agg() works. (GH15015open in new window)

In [8]: df = pd.DataFrame({'A': [1, 2, 3],
   ...:                    'B': [1., 2., 3.],
   ...:                    'C': ['foo', 'bar', 'baz'],
   ...:                    'D': pd.date_range('20130101', periods=3)})
   ...: 

In [9]: df.dtypes
Out[9]: 
A             int64
B           float64
C            object
D    datetime64[ns]
Length: 4, dtype: object
In [10]: df.agg(['min', 'sum'])
Out[10]: 
     A    B          C          D
min  1  1.0        bar 2013-01-01
sum  6  6.0  foobarbaz        NaT

[2 rows x 4 columns]

dtype keyword for data IO

The 'python' engine for read_csv()open in new window, as well as the read_fwf()open in new window function for parsing fixed-width text files and read_excel()open in new window for parsing Excel files, now accept the dtype keyword argument for specifying the types of specific columns (GH14295open in new window). See the io docsopen in new window for more information.

In [11]: data = "a  b\n1  2\n3  4"

In [12]: pd.read_fwf(StringIO(data)).dtypes
Out[12]: 
a    int64
b    int64
Length: 2, dtype: object

In [13]: pd.read_fwf(StringIO(data), dtype={'a': 'float64', 'b': 'object'}).dtypes
Out[13]: 
a    float64
b     object
Length: 2, dtype: object

.to_datetime() has gained an origin parameter

to_datetime()open in new window has gained a new parameter, origin, to define a reference date from where to compute the resulting timestamps when parsing numerical values with a specific unit specified. (GH11276open in new window, GH11745open in new window)

For example, with 1960-01-01 as the starting date:

In [14]: pd.to_datetime([1, 2, 3], unit='D', origin=pd.Timestamp('1960-01-01'))
Out[14]: DatetimeIndex(['1960-01-02', '1960-01-03', '1960-01-04'], dtype='datetime64[ns]', freq=None)

The default is set at origin='unix', which defaults to 1970-01-01 00:00:00, which is commonly called ‘unix epoch’ or POSIX time. This was the previous default, so this is a backward compatible change.

In [15]: pd.to_datetime([1, 2, 3], unit='D')
Out[15]: DatetimeIndex(['1970-01-02', '1970-01-03', '1970-01-04'], dtype='datetime64[ns]', freq=None)

Groupby enhancements

Strings passed to DataFrame.groupby() as the by parameter may now reference either column names or index level names. Previously, only column names could be referenced. This allows to easily group by a column and index level at the same time. (GH5677open in new window)

In [16]: arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
   ....:           ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
   ....: 

In [17]: index = pd.MultiIndex.from_arrays(arrays, names=['first', 'second'])

In [18]: df = pd.DataFrame({'A': [1, 1, 1, 1, 2, 2, 3, 3],
   ....:                    'B': np.arange(8)},
   ....:                   index=index)
   ....: 

In [19]: df
Out[19]: 
              A  B
first second      
bar   one     1  0
      two     1  1
baz   one     1  2
      two     1  3
foo   one     2  4
      two     2  5
qux   one     3  6
      two     3  7

[8 rows x 2 columns]

In [20]: df.groupby(['second', 'A']).sum()
Out[20]: 
          B
second A   
one    1  2
       2  4
       3  6
two    1  4
       2  5
       3  7

[6 rows x 1 columns]

Better support for compressed URLs in read_csv

The compression code was refactored (GH12688open in new window). As a result, reading dataframes from URLs in read_csv()open in new window or read_table()open in new window now supports additional compression methods: xz, bz2, and zip (GH14570open in new window). Previously, only gzip compression was supported. By default, compression of URLs and paths are now inferred using their file extensions. Additionally, support for bz2 compression in the python 2 C-engine improved (GH14874open in new window).

In [21]: url = ('https://github.com/{repo}/raw/{branch}/{path}'
   ....:        .format(repo='pandas-dev/pandas',
   ....:                branch='master',
   ....:                path='pandas/tests/io/parser/data/salaries.csv.bz2'))
   ....: 

# default, infer compression
In [22]: df = pd.read_csv(url, sep='\t', compression='infer')

# explicitly specify compression
In [23]: df = pd.read_csv(url, sep='\t', compression='bz2')

In [24]: df.head(2)
Out[24]: 
       S  X  E  M
0  13876  1  1  1
1  11608  1  3  0

[2 rows x 4 columns]

Pickle file I/O now supports compression

read_pickle()open in new window, DataFrame.to_pickle()open in new window and Series.to_pickle()open in new window can now read from and write to compressed pickle files. Compression methods can be an explicit parameter or be inferred from the file extension. See the docs here.open in new window

In [25]: df = pd.DataFrame({'A': np.random.randn(1000),
   ....:                    'B': 'foo',
   ....:                    'C': pd.date_range('20130101', periods=1000, freq='s')})
   ....:

Using an explicit compression type

In [26]: df.to_pickle("data.pkl.compress", compression="gzip")

In [27]: rt = pd.read_pickle("data.pkl.compress", compression="gzip")

In [28]: rt.head()
Out[28]: 
          A    B                   C
0 -1.344312  foo 2013-01-01 00:00:00
1  0.844885  foo 2013-01-01 00:00:01
2  1.075770  foo 2013-01-01 00:00:02
3 -0.109050  foo 2013-01-01 00:00:03
4  1.643563  foo 2013-01-01 00:00:04

[5 rows x 3 columns]

The default is to infer the compression type from the extension (compression='infer'):

In [29]: df.to_pickle("data.pkl.gz")

In [30]: rt = pd.read_pickle("data.pkl.gz")

In [31]: rt.head()
Out[31]: 
          A    B                   C
0 -1.344312  foo 2013-01-01 00:00:00
1  0.844885  foo 2013-01-01 00:00:01
2  1.075770  foo 2013-01-01 00:00:02
3 -0.109050  foo 2013-01-01 00:00:03
4  1.643563  foo 2013-01-01 00:00:04

[5 rows x 3 columns]

In [32]: df["A"].to_pickle("s1.pkl.bz2")

In [33]: rt = pd.read_pickle("s1.pkl.bz2")

In [34]: rt.head()
Out[34]: 
0   -1.344312
1    0.844885
2    1.075770
3   -0.109050
4    1.643563
Name: A, Length: 5, dtype: float64

UInt64 support improved

Pandas has significantly improved support for operations involving unsigned, or purely non-negative, integers. Previously, handling these integers would result in improper rounding or data-type casting, leading to incorrect results. Notably, a new numerical index, UInt64Index, has been created (GH14937open in new window)

In [35]: idx = pd.UInt64Index([1, 2, 3])

In [36]: df = pd.DataFrame({'A': ['a', 'b', 'c']}, index=idx)

In [37]: df.index
Out[37]: UInt64Index([1, 2, 3], dtype='uint64')

GroupBy on categoricals

In previous versions, .groupby(..., sort=False) would fail with a ValueError when grouping on a categorical series with some categories not appearing in the data. (GH13179open in new window)

In [38]: chromosomes = np.r_[np.arange(1, 23).astype(str), ['X', 'Y']]

In [39]: df = pd.DataFrame({
   ....:     'A': np.random.randint(100),
   ....:     'B': np.random.randint(100),
   ....:     'C': np.random.randint(100),
   ....:     'chromosomes': pd.Categorical(np.random.choice(chromosomes, 100),
   ....:                                   categories=chromosomes,
   ....:                                   ordered=True)})
   ....: 

In [40]: df
Out[40]: 
     A   B   C chromosomes
0   87  22  81           4
1   87  22  81          13
2   87  22  81          22
3   87  22  81           2
4   87  22  81           6
..  ..  ..  ..         ...
95  87  22  81           8
96  87  22  81          11
97  87  22  81           X
98  87  22  81           1
99  87  22  81          19

[100 rows x 4 columns]

Previous behavior:

In [3]: df[df.chromosomes != '1'].groupby('chromosomes', sort=False).sum()
---------------------------------------------------------------------------
ValueError: items in new_categories are not the same as in old categories

New behavior:

In [41]: df[df.chromosomes != '1'].groupby('chromosomes', sort=False).sum()
Out[41]: 
               A    B    C
chromosomes               
2            348   88  324
3            348   88  324
4            348   88  324
5            261   66  243
6            174   44  162
...          ...  ...  ...
22           348   88  324
X            348   88  324
Y            435  110  405
1              0    0    0
21             0    0    0

[24 rows x 3 columns]

Table schema output

The new orient 'table' for DataFrame.to_json()open in new window will generate a Table Schemaopen in new window compatible string representation of the data.

In [42]: df = pd.DataFrame(
   ....:     {'A': [1, 2, 3],
   ....:      'B': ['a', 'b', 'c'],
   ....:      'C': pd.date_range('2016-01-01', freq='d', periods=3)},
   ....:     index=pd.Index(range(3), name='idx'))
   ....: 

In [43]: df
Out[43]: 
     A  B          C
idx                 
0    1  a 2016-01-01
1    2  b 2016-01-02
2    3  c 2016-01-03

[3 rows x 3 columns]

In [44]: df.to_json(orient='table')
Out[44]: '{"schema": {"fields":[{"name":"idx","type":"integer"},{"name":"A","type":"integer"},{"name":"B","type":"string"},{"name":"C","type":"datetime"}],"primaryKey":["idx"],"pandas_version":"0.20.0"}, "data": [{"idx":0,"A":1,"B":"a","C":"2016-01-01T00:00:00.000Z"},{"idx":1,"A":2,"B":"b","C":"2016-01-02T00:00:00.000Z"},{"idx":2,"A":3,"B":"c","C":"2016-01-03T00:00:00.000Z"}]}'

See IO: Table Schema for more informationopen in new window.

Additionally, the repr for DataFrame and Series can now publish this JSON Table schema representation of the Series or DataFrame if you are using IPython (or another frontend like nteractopen in new window using the Jupyter messaging protocol). This gives frontends like the Jupyter notebook and nteractopen in new window more flexibility in how they display pandas objects, since they have more information about the data. You must enable this by setting the display.html.table_schema option to True.

SciPy sparse matrix from/to SparseDataFrame

Pandas now supports creating sparse dataframes directly from scipy.sparse.spmatrix instances. See the documentationopen in new window for more information. (GH4343open in new window)

All sparse formats are supported, but matrices that are not in COOrdinateopen in new window format will be converted, copying data as needed.

In [45]: from scipy.sparse import csr_matrix

In [46]: arr = np.random.random(size=(1000, 5))

In [47]: arr[arr < .9] = 0

In [48]: sp_arr = csr_matrix(arr)

In [49]: sp_arr
Out[49]: 
<1000x5 sparse matrix of type '<class 'numpy.float64'>'
	with 501 stored elements in Compressed Sparse Row format>

In [50]: sdf = pd.SparseDataFrame(sp_arr)

In [51]: sdf
Out[51]: 
      0   1         2   3         4
0   NaN NaN  0.977426 NaN       NaN
1   NaN NaN       NaN NaN  0.969340
2   NaN NaN       NaN NaN       NaN
3   NaN NaN       NaN NaN       NaN
4   NaN NaN       NaN NaN       NaN
..   ..  ..       ...  ..       ...
995 NaN NaN       NaN NaN  0.917524
996 NaN NaN       NaN NaN       NaN
997 NaN NaN       NaN NaN  0.968178
998 NaN NaN       NaN NaN  0.901563
999 NaN NaN       NaN NaN       NaN

[1000 rows x 5 columns]

To convert a SparseDataFrame back to sparse SciPy matrix in COO format, you can use:

In [52]: sdf.to_coo()
Out[52]: 
<1000x5 sparse matrix of type '<class 'numpy.float64'>'
	with 501 stored elements in COOrdinate format>

Excel output for styled DataFrames

Experimental support has been added to export DataFrame.style formats to Excel using the openpyxl engine. (GH15530open in new window)

For example, after running the following, styled.xlsx renders as below:

In [53]: np.random.seed(24)

In [54]: df = pd.DataFrame({'A': np.linspace(1, 10, 10)})

In [55]: df = pd.concat([df, pd.DataFrame(np.random.RandomState(24).randn(10, 4),
   ....:                                  columns=list('BCDE'))],
   ....:                axis=1)
   ....: 

In [56]: df.iloc[0, 2] = np.nan

In [57]: df
Out[57]: 
      A         B         C         D         E
0   1.0  1.329212       NaN -0.316280 -0.990810
1   2.0 -1.070816 -1.438713  0.564417  0.295722
2   3.0 -1.626404  0.219565  0.678805  1.889273
3   4.0  0.961538  0.104011 -0.481165  0.850229
4   5.0  1.453425  1.057737  0.165562  0.515018
5   6.0 -1.336936  0.562861  1.392855 -0.063328
6   7.0  0.121668  1.207603 -0.002040  1.627796
7   8.0  0.354493  1.037528 -0.385684  0.519818
8   9.0  1.686583 -1.325963  1.428984 -2.089354
9  10.0 -0.129820  0.631523 -0.586538  0.290720

[10 rows x 5 columns]

In [58]: styled = (df.style
   ....:           .applymap(lambda val: 'color: %s' % 'red' if val < 0 else 'black')
   ....:           .highlight_max())
   ....: 

In [59]: styled.to_excel('styled.xlsx', engine='openpyxl')

style-excel1

See the Style documentationopen in new window for more detail.

IntervalIndex

pandas has gained an IntervalIndex with its own dtype, interval as well as the Interval scalar type. These allow first-class support for interval notation, specifically as a return type for the categories in cut()open in new window and qcut()open in new window. The IntervalIndex allows some unique indexing, see the docsopen in new window. (GH7640open in new window, GH8625open in new window)

Warning

These indexing behaviors of the IntervalIndex are provisional and may change in a future version of pandas. Feedback on usage is welcome.

Previous behavior:

The returned categories were strings, representing Intervals

In [1]: c = pd.cut(range(4), bins=2)

In [2]: c
Out[2]:
[(-0.003, 1.5], (-0.003, 1.5], (1.5, 3], (1.5, 3]]
Categories (2, object): [(-0.003, 1.5] < (1.5, 3]]

In [3]: c.categories
Out[3]: Index(['(-0.003, 1.5]', '(1.5, 3]'], dtype='object')

New behavior:

In [60]: c = pd.cut(range(4), bins=2)

In [61]: c
Out[61]: 
[(-0.003, 1.5], (-0.003, 1.5], (1.5, 3.0], (1.5, 3.0]]
Categories (2, interval[float64]): [(-0.003, 1.5] < (1.5, 3.0]]

In [62]: c.categories
Out[62]: 
IntervalIndex([(-0.003, 1.5], (1.5, 3.0]],
              closed='right',
              dtype='interval[float64]')

Furthermore, this allows one to bin other data with these same bins, with NaN representing a missing value similar to other dtypes.

In [63]: pd.cut([0, 3, 5, 1], bins=c.categories)
Out[63]: 
[(-0.003, 1.5], (1.5, 3.0], NaN, (-0.003, 1.5]]
Categories (2, interval[float64]): [(-0.003, 1.5] < (1.5, 3.0]]

An IntervalIndex can also be used in Series and DataFrame as the index.

In [64]: df = pd.DataFrame({'A': range(4),
   ....:                    'B': pd.cut([0, 3, 1, 1], bins=c.categories)
   ....:                    }).set_index('B')
   ....: 

In [65]: df
Out[65]: 
               A
B               
(-0.003, 1.5]  0
(1.5, 3.0]     1
(-0.003, 1.5]  2
(-0.003, 1.5]  3

[4 rows x 1 columns]

Selecting via a specific interval:

In [66]: df.loc[pd.Interval(1.5, 3.0)]
Out[66]: 
A    1
Name: (1.5, 3.0], Length: 1, dtype: int64

Selecting via a scalar value that is contained in the intervals.

In [67]: df.loc[0]
Out[67]: 
               A
B               
(-0.003, 1.5]  0
(-0.003, 1.5]  2
(-0.003, 1.5]  3

[3 rows x 1 columns]

Other enhancements

Backwards incompatible API changes

Possible incompatibility for HDF5 formats created with pandas < 0.13.0

pd.TimeSeries was deprecated officially in 0.17.0, though has already been an alias since 0.13.0. It has been dropped in favor of pd.Series. (GH15098open in new window).

This may cause HDF5 files that were created in prior versions to become unreadable if pd.TimeSeries was used. This is most likely to be for pandas < 0.13.0. If you find yourself in this situation. You can use a recent prior version of pandas to read in your HDF5 files, then write them out again after applying the procedure below.

In [2]: s = pd.TimeSeries([1, 2, 3], index=pd.date_range('20130101', periods=3))

In [3]: s
Out[3]:
2013-01-01    1
2013-01-02    2
2013-01-03    3
Freq: D, dtype: int64

In [4]: type(s)
Out[4]: pandas.core.series.TimeSeries

In [5]: s = pd.Series(s)

In [6]: s
Out[6]:
2013-01-01    1
2013-01-02    2
2013-01-03    3
Freq: D, dtype: int64

In [7]: type(s)
Out[7]: pandas.core.series.Series

Map on Index types now return other Index types

map on an Index now returns an Index, not a numpy array (GH12766open in new window)

In [68]: idx = pd.Index([1, 2])

In [69]: idx
Out[69]: Int64Index([1, 2], dtype='int64')

In [70]: mi = pd.MultiIndex.from_tuples([(1, 2), (2, 4)])

In [71]: mi
Out[71]: 
MultiIndex([(1, 2),
            (2, 4)],
           )

Previous behavior:

In [5]: idx.map(lambda x: x * 2)
Out[5]: array([2, 4])

In [6]: idx.map(lambda x: (x, x * 2))
Out[6]: array([(1, 2), (2, 4)], dtype=object)

In [7]: mi.map(lambda x: x)
Out[7]: array([(1, 2), (2, 4)], dtype=object)

In [8]: mi.map(lambda x: x[0])
Out[8]: array([1, 2])

New behavior:

In [72]: idx.map(lambda x: x * 2)
Out[72]: Int64Index([2, 4], dtype='int64')

In [73]: idx.map(lambda x: (x, x * 2))
Out[73]: 
MultiIndex([(1, 2),
            (2, 4)],
           )

In [74]: mi.map(lambda x: x)
Out[74]: 
MultiIndex([(1, 2),
            (2, 4)],
           )

In [75]: mi.map(lambda x: x[0])
Out[75]: Int64Index([1, 2], dtype='int64')

map on a Series with datetime64 values may return int64 dtypes rather than int32

In [76]: s = pd.Series(pd.date_range('2011-01-02T00:00', '2011-01-02T02:00', freq='H')
   ....:               .tz_localize('Asia/Tokyo'))
   ....: 

In [77]: s
Out[77]: 
0   2011-01-02 00:00:00+09:00
1   2011-01-02 01:00:00+09:00
2   2011-01-02 02:00:00+09:00
Length: 3, dtype: datetime64[ns, Asia/Tokyo]

Previous behavior:

In [9]: s.map(lambda x: x.hour)
Out[9]:
0    0
1    1
2    2
dtype: int32

New behavior:

In [78]: s.map(lambda x: x.hour)
Out[78]: 
0    0
1    1
2    2
Length: 3, dtype: int64

Accessing datetime fields of Index now return Index

The datetime-related attributes (see hereopen in new window for an overview) of DatetimeIndex, PeriodIndex and TimedeltaIndex previously returned numpy arrays. They will now return a new Index object, except in the case of a boolean field, where the result will still be a boolean ndarray. (GH15022open in new window)

Previous behaviour:

In [1]: idx = pd.date_range("2015-01-01", periods=5, freq='10H')

In [2]: idx.hour
Out[2]: array([ 0, 10, 20,  6, 16], dtype=int32)

New behavior:

In [79]: idx = pd.date_range("2015-01-01", periods=5, freq='10H')

In [80]: idx.hour
Out[80]: Int64Index([0, 10, 20, 6, 16], dtype='int64')

This has the advantage that specific Index methods are still available on the result. On the other hand, this might have backward incompatibilities: e.g. compared to numpy arrays, Index objects are not mutable. To get the original ndarray, you can always convert explicitly using np.asarray(idx.hour).

pd.unique will now be consistent with extension types

In prior versions, using Series.unique()open in new window and pandas.unique()open in new window on Categorical and tz-aware data-types would yield different return types. These are now made consistent. (GH15903open in new window)

  • Datetime tz-aware

Previous behaviour:

# Series
In [5]: pd.Series([pd.Timestamp('20160101', tz='US/Eastern'),
   ...:            pd.Timestamp('20160101', tz='US/Eastern')]).unique()
Out[5]: array([Timestamp('2016-01-01 00:00:00-0500', tz='US/Eastern')], dtype=object)

In [6]: pd.unique(pd.Series([pd.Timestamp('20160101', tz='US/Eastern'),
   ...:                      pd.Timestamp('20160101', tz='US/Eastern')]))
Out[6]: array(['2016-01-01T05:00:00.000000000'], dtype='datetime64[ns]')

# Index
In [7]: pd.Index([pd.Timestamp('20160101', tz='US/Eastern'),
   ...:           pd.Timestamp('20160101', tz='US/Eastern')]).unique()
Out[7]: DatetimeIndex(['2016-01-01 00:00:00-05:00'], dtype='datetime64[ns, US/Eastern]', freq=None)

In [8]: pd.unique([pd.Timestamp('20160101', tz='US/Eastern'),
   ...:            pd.Timestamp('20160101', tz='US/Eastern')])
Out[8]: array(['2016-01-01T05:00:00.000000000'], dtype='datetime64[ns]')

New behavior:

# Series, returns an array of Timestamp tz-aware
In [81]: pd.Series([pd.Timestamp(r'20160101', tz=r'US/Eastern'),
   ....:            pd.Timestamp(r'20160101', tz=r'US/Eastern')]).unique()
   ....: 
Out[81]: 
<DatetimeArray>
['2016-01-01 00:00:00-05:00']
Length: 1, dtype: datetime64[ns, US/Eastern]

In [82]: pd.unique(pd.Series([pd.Timestamp('20160101', tz='US/Eastern'),
   ....:           pd.Timestamp('20160101', tz='US/Eastern')]))
   ....: 
Out[82]: 
<DatetimeArray>
['2016-01-01 00:00:00-05:00']
Length: 1, dtype: datetime64[ns, US/Eastern]

# Index, returns a DatetimeIndex
In [83]: pd.Index([pd.Timestamp('20160101', tz='US/Eastern'),
   ....:           pd.Timestamp('20160101', tz='US/Eastern')]).unique()
   ....: 
Out[83]: DatetimeIndex(['2016-01-01 00:00:00-05:00'], dtype='datetime64[ns, US/Eastern]', freq=None)

In [84]: pd.unique(pd.Index([pd.Timestamp('20160101', tz='US/Eastern'),
   ....:                     pd.Timestamp('20160101', tz='US/Eastern')]))
   ....: 
Out[84]: DatetimeIndex(['2016-01-01 00:00:00-05:00'], dtype='datetime64[ns, US/Eastern]', freq=None)
  • Categoricals

Previous behaviour:

In [1]: pd.Series(list('baabc'), dtype='category').unique()
Out[1]:
[b, a, c]
Categories (3, object): [b, a, c]

In [2]: pd.unique(pd.Series(list('baabc'), dtype='category'))
Out[2]: array(['b', 'a', 'c'], dtype=object)

New behavior:

# returns a Categorical
In [85]: pd.Series(list('baabc'), dtype='category').unique()
Out[85]: 
[b, a, c]
Categories (3, object): [b, a, c]

In [86]: pd.unique(pd.Series(list('baabc'), dtype='category'))
Out[86]: 
[b, a, c]
Categories (3, object): [b, a, c]

S3 file handling

pandas now uses s3fsopen in new window for handling S3 connections. This shouldn’t break any code. However, since s3fs is not a required dependency, you will need to install it separately, like boto in prior versions of pandas. (GH11915open in new window).

Partial string indexing changes

DatetimeIndex Partial String Indexingopen in new window now works as an exact match, provided that string resolution coincides with index resolution, including a case when both are seconds (GH14826open in new window). See Slice vs. Exact Matchopen in new window for details.

In [87]: df = pd.DataFrame({'a': [1, 2, 3]}, pd.DatetimeIndex(['2011-12-31 23:59:59',
   ....:                                                       '2012-01-01 00:00:00',
   ....:                                                       '2012-01-01 00:00:01']))
   ....:

Previous behavior:

In [4]: df['2011-12-31 23:59:59']
Out[4]:
                       a
2011-12-31 23:59:59  1

In [5]: df['a']['2011-12-31 23:59:59']
Out[5]:
2011-12-31 23:59:59    1
Name: a, dtype: int64

New behavior:

In [4]: df['2011-12-31 23:59:59']
KeyError: '2011-12-31 23:59:59'

In [5]: df['a']['2011-12-31 23:59:59']
Out[5]: 1

Concat of different float dtypes will not automatically upcast

Previously, concat of multiple objects with different float dtypes would automatically upcast results to a dtype of float64. Now the smallest acceptable dtype will be used (GH13247open in new window)

In [88]: df1 = pd.DataFrame(np.array([1.0], dtype=np.float32, ndmin=2))

In [89]: df1.dtypes
Out[89]: 
0    float32
Length: 1, dtype: object

In [90]: df2 = pd.DataFrame(np.array([np.nan], dtype=np.float32, ndmin=2))

In [91]: df2.dtypes
Out[91]: 
0    float32
Length: 1, dtype: object

Previous behavior:

In [7]: pd.concat([df1, df2]).dtypes
Out[7]:
0    float64
dtype: object

New behavior:

In [92]: pd.concat([df1, df2]).dtypes
Out[92]: 
0    float32
Length: 1, dtype: object

Pandas Google BigQuery support has moved

pandas has split off Google BigQuery support into a separate package pandas-gbq. You can conda install pandas-gbq -c conda-forge or pip install pandas-gbq to get it. The functionality of read_gbq()open in new window and DataFrame.to_gbq()open in new window remain the same with the currently released version of pandas-gbq=0.1.4. Documentation is now hosted hereopen in new window (GH15347open in new window)

Memory usage for Index is more accurate

In previous versions, showing .memory_usage() on a pandas structure that has an index, would only include actual index values and not include structures that facilitated fast indexing. This will generally be different for Index and MultiIndex and less-so for other index types. (GH15237open in new window)

Previous behavior:

In [8]: index = pd.Index(['foo', 'bar', 'baz'])

In [9]: index.memory_usage(deep=True)
Out[9]: 180

In [10]: index.get_loc('foo')
Out[10]: 0

In [11]: index.memory_usage(deep=True)
Out[11]: 180

New behavior:

In [8]: index = pd.Index(['foo', 'bar', 'baz'])

In [9]: index.memory_usage(deep=True)
Out[9]: 180

In [10]: index.get_loc('foo')
Out[10]: 0

In [11]: index.memory_usage(deep=True)
Out[11]: 260

DataFrame.sort_index changes

In certain cases, calling .sort_index() on a MultiIndexed DataFrame would return the same DataFrame without seeming to sort. This would happen with a lexsorted, but non-monotonic levels. (GH15622open in new window, GH15687open in new window, GH14015open in new window, GH13431open in new window, GH15797open in new window)

This is unchanged from prior versions, but shown for illustration purposes:

In [93]: df = pd.DataFrame(np.arange(6), columns=['value'],
   ....:                   index=pd.MultiIndex.from_product([list('BA'), range(3)]))
   ....: 

In [94]: df
Out[94]: 
     value
B 0      0
  1      1
  2      2
A 0      3
  1      4
  2      5

[6 rows x 1 columns]
In [95]: df.index.is_lexsorted()
Out[95]: False

In [96]: df.index.is_monotonic
Out[96]: False

Sorting works as expected

In [97]: df.sort_index()
Out[97]: 
     value
A 0      3
  1      4
  2      5
B 0      0
  1      1
  2      2

[6 rows x 1 columns]
In [98]: df.sort_index().index.is_lexsorted()
Out[98]: True

In [99]: df.sort_index().index.is_monotonic
Out[99]: True

However, this example, which has a non-monotonic 2nd level, doesn’t behave as desired.

In [100]: df = pd.DataFrame({'value': [1, 2, 3, 4]},
   .....:                   index=pd.MultiIndex([['a', 'b'], ['bb', 'aa']],
   .....:                                       [[0, 0, 1, 1], [0, 1, 0, 1]]))
   .....: 

In [101]: df
Out[101]: 
      value
a bb      1
  aa      2
b bb      3
  aa      4

[4 rows x 1 columns]

Previous behavior:

In [11]: df.sort_index()
Out[11]:
      value
a bb      1
  aa      2
b bb      3
  aa      4

In [14]: df.sort_index().index.is_lexsorted()
Out[14]: True

In [15]: df.sort_index().index.is_monotonic
Out[15]: False

New behavior:

In [102]: df.sort_index()
Out[102]: 
      value
a aa      2
  bb      1
b aa      4
  bb      3

[4 rows x 1 columns]

In [103]: df.sort_index().index.is_lexsorted()
Out[103]: True

In [104]: df.sort_index().index.is_monotonic
Out[104]: True

Groupby describe formatting

The output formatting of groupby.describe() now labels the describe() metrics in the columns instead of the index. This format is consistent with groupby.agg() when applying multiple functions at once. (GH4792open in new window)

Previous behavior:

In [1]: df = pd.DataFrame({'A': [1, 1, 2, 2], 'B': [1, 2, 3, 4]})

In [2]: df.groupby('A').describe()
Out[2]:
                B
A
1 count  2.000000
  mean   1.500000
  std    0.707107
  min    1.000000
  25%    1.250000
  50%    1.500000
  75%    1.750000
  max    2.000000
2 count  2.000000
  mean   3.500000
  std    0.707107
  min    3.000000
  25%    3.250000
  50%    3.500000
  75%    3.750000
  max    4.000000

In [3]: df.groupby('A').agg([np.mean, np.std, np.min, np.max])
Out[3]:
     B
  mean       std amin amax
A
1  1.5  0.707107    1    2
2  3.5  0.707107    3    4

New behavior:

In [105]: df = pd.DataFrame({'A': [1, 1, 2, 2], 'B': [1, 2, 3, 4]})

In [106]: df.groupby('A').describe()
Out[106]: 
      B                                          
  count mean       std  min   25%  50%   75%  max
A                                                
1   2.0  1.5  0.707107  1.0  1.25  1.5  1.75  2.0
2   2.0  3.5  0.707107  3.0  3.25  3.5  3.75  4.0

[2 rows x 8 columns]

In [107]: df.groupby('A').agg([np.mean, np.std, np.min, np.max])
Out[107]: 
     B                    
  mean       std amin amax
A                         
1  1.5  0.707107    1    2
2  3.5  0.707107    3    4

[2 rows x 4 columns]

Window binary corr/cov operations return a MultiIndex DataFrame

A binary window operation, like .corr() or .cov(), when operating on a .rolling(..), .expanding(..), or .ewm(..) object, will now return a 2-level MultiIndexed DataFrame rather than a Panel, as Panel is now deprecated, see here. These are equivalent in function, but a MultiIndexed DataFrame enjoys more support in pandas. See the section on Windowed Binary Operationsopen in new window for more information. (GH15677open in new window)

In [108]: np.random.seed(1234)

In [109]: df = pd.DataFrame(np.random.rand(100, 2),
   .....:                   columns=pd.Index(['A', 'B'], name='bar'),
   .....:                   index=pd.date_range('20160101',
   .....:                                       periods=100, freq='D', name='foo'))
   .....: 

In [110]: df.tail()
Out[110]: 
bar                A         B
foo                           
2016-04-05  0.640880  0.126205
2016-04-06  0.171465  0.737086
2016-04-07  0.127029  0.369650
2016-04-08  0.604334  0.103104
2016-04-09  0.802374  0.945553

[5 rows x 2 columns]

Previous behavior:

In [2]: df.rolling(12).corr()
Out[2]:
<class 'pandas.core.panel.Panel'>
Dimensions: 100 (items) x 2 (major_axis) x 2 (minor_axis)
Items axis: 2016-01-01 00:00:00 to 2016-04-09 00:00:00
Major_axis axis: A to B
Minor_axis axis: A to B

New behavior:

In [111]: res = df.rolling(12).corr()

In [112]: res.tail()
Out[112]: 
bar                    A         B
foo        bar                    
2016-04-07 B   -0.132090  1.000000
2016-04-08 A    1.000000 -0.145775
           B   -0.145775  1.000000
2016-04-09 A    1.000000  0.119645
           B    0.119645  1.000000

[5 rows x 2 columns]

Retrieving a correlation matrix for a cross-section

In [113]: df.rolling(12).corr().loc['2016-04-07']
Out[113]: 
bar                   A        B
foo        bar                  
2016-04-07 A    1.00000 -0.13209
           B   -0.13209  1.00000

[2 rows x 2 columns]

HDFStore where string comparison

In previous versions most types could be compared to string column in a HDFStore usually resulting in an invalid comparison, returning an empty result frame. These comparisons will now raise a TypeError (GH15492open in new window)

In [114]: df = pd.DataFrame({'unparsed_date': ['2014-01-01', '2014-01-01']})

In [115]: df.to_hdf('store.h5', 'key', format='table', data_columns=True)

In [116]: df.dtypes
Out[116]: 
unparsed_date    object
Length: 1, dtype: object

Previous behavior:

In [4]: pd.read_hdf('store.h5', 'key', where='unparsed_date > ts')
File "<string>", line 1
  (unparsed_date > 1970-01-01 00:00:01.388552400)
                        ^
SyntaxError: invalid token

New behavior:

In [18]: ts = pd.Timestamp('2014-01-01')

In [19]: pd.read_hdf('store.h5', 'key', where='unparsed_date > ts')
TypeError: Cannot compare 2014-01-01 00:00:00 of
type <class 'pandas.tslib.Timestamp'> to string column

Index.intersection and inner join now preserve the order of the left Index

Index.intersection()open in new window now preserves the order of the calling Index (left) instead of the other Index (right) (GH15582open in new window). This affects inner joins, DataFrame.join()open in new window and merge()open in new window, and the .align method.

  • Index.intersection
In [117]: left = pd.Index([2, 1, 0])

In [118]: left
Out[118]: Int64Index([2, 1, 0], dtype='int64')

In [119]: right = pd.Index([1, 2, 3])

In [120]: right
Out[120]: Int64Index([1, 2, 3], dtype='int64')

Previous behavior:

In [4]: left.intersection(right)
Out[4]: Int64Index([1, 2], dtype='int64')

New behavior:

In [121]: left.intersection(right)
Out[121]: Int64Index([2, 1], dtype='int64')
  • DataFrame.join and pd.merge
In [122]: left = pd.DataFrame({'a': [20, 10, 0]}, index=[2, 1, 0])

In [123]: left
Out[123]: 
    a
2  20
1  10
0   0

[3 rows x 1 columns]

In [124]: right = pd.DataFrame({'b': [100, 200, 300]}, index=[1, 2, 3])

In [125]: right
Out[125]: 
     b
1  100
2  200
3  300

[3 rows x 1 columns]

Previous behavior:

In [4]: left.join(right, how='inner')
Out[4]:
   a    b
1  10  100
2  20  200

New behavior:

In [126]: left.join(right, how='inner')
Out[126]: 
    a    b
2  20  200
1  10  100

[2 rows x 2 columns]

Pivot table always returns a DataFrame

The documentation for pivot_table()open in new window states that a DataFrame is always returned. Here a bug is fixed that allowed this to return a Series under certain circumstance. (GH4386open in new window)

In [127]: df = pd.DataFrame({'col1': [3, 4, 5],
   .....:                    'col2': ['C', 'D', 'E'],
   .....:                    'col3': [1, 3, 9]})
   .....: 

In [128]: df
Out[128]: 
   col1 col2  col3
0     3    C     1
1     4    D     3
2     5    E     9

[3 rows x 3 columns]

Previous behavior:

In [2]: df.pivot_table('col1', index=['col3', 'col2'], aggfunc=np.sum)
Out[2]:
col3  col2
1     C       3
3     D       4
9     E       5
Name: col1, dtype: int64

New behavior:

In [129]: df.pivot_table('col1', index=['col3', 'col2'], aggfunc=np.sum)
Out[129]: 
           col1
col3 col2      
1    C        3
3    D        4
9    E        5

[3 rows x 1 columns]

Other API changes

  • numexpr version is now required to be >= 2.4.6 and it will not be used at all if this requisite is not fulfilled (GH15213open in new window).
  • CParserError has been renamed to ParserError in pd.read_csv() and will be removed in the future (GH12665open in new window)
  • SparseArray.cumsum() and SparseSeries.cumsum() will now always return SparseArray and SparseSeries respectively (GH12855open in new window)
  • DataFrame.applymap() with an empty DataFrame will return a copy of the empty DataFrame instead of a Series (GH8222open in new window)
  • Series.map() now respects default values of dictionary subclasses with a __missing__ method, such as collections.Counter (GH15999open in new window)
  • .loc has compat with .ix for accepting iterators, and NamedTuples (GH15120open in new window)
  • interpolate() and fillna() will raise a ValueError if the limit keyword argument is not greater than 0. (GH9217open in new window)
  • pd.read_csv() will now issue a ParserWarning whenever there are conflicting values provided by the dialect parameter and the user (GH14898open in new window)
  • pd.read_csv() will now raise a ValueError for the C engine if the quote character is larger than than one byte (GH11592open in new window)
  • inplace arguments now require a boolean value, else a ValueError is thrown (GH14189open in new window)
  • pandas.api.types.is_datetime64_ns_dtype will now report True on a tz-aware dtype, similar to pandas.api.types.is_datetime64_any_dtype
  • DataFrame.asof() will return a null filled Series instead the scalar NaN if a match is not found (GH15118open in new window)
  • Specific support for copy.copy() and copy.deepcopy() functions on NDFrame objects (GH15444open in new window)
  • Series.sort_values() accepts a one element list of bool for consistency with the behavior of DataFrame.sort_values() (GH15604open in new window)
  • .merge() and .join() on category dtype columns will now preserve the category dtype when possible (GH10409open in new window)
  • SparseDataFrame.default_fill_value will be 0, previously was nan in the return from pd.get_dummies(..., sparse=True) (GH15594open in new window)
  • The default behaviour of Series.str.match has changed from extracting groups to matching the pattern. The extracting behaviour was deprecated since pandas version 0.13.0 and can be done with the Series.str.extract method (GH5224open in new window). As a consequence, the as_indexer keyword is ignored (no longer needed to specify the new behaviour) and is deprecated.
  • NaT will now correctly report False for datetimelike boolean operations such as is_month_start (GH15781open in new window)
  • NaT will now correctly return np.nan for Timedelta and Period accessors such as days and quarter (GH15782open in new window)
  • NaT will now returns NaT for tz_localize and tz_convert methods (GH15830open in new window)
  • DataFrame and Panel constructors with invalid input will now raise ValueError rather than PandasError, if called with scalar inputs and not axes (GH15541open in new window)
  • DataFrame and Panel constructors with invalid input will now raise ValueError rather than pandas.core.common.PandasError, if called with scalar inputs and not axes; The exception PandasError is removed as well. (GH15541open in new window)
  • The exception pandas.core.common.AmbiguousIndexError is removed as it is not referenced (GH15541open in new window)

Reorganization of the library: privacy changes

Modules privacy has changed

Some formerly public python/c/c++/cython extension modules have been moved and/or renamed. These are all removed from the public API. Furthermore, the pandas.core, pandas.compat, and pandas.util top-level modules are now considered to be PRIVATE. If indicated, a deprecation warning will be issued if you reference theses modules. (GH12588open in new window)

Previous LocationNew LocationDeprecated
pandas.libpandas._libs.libX
pandas.tslibpandas._libs.tslibX
pandas.computationpandas.core.computationX
pandas.msgpackpandas.io.msgpack
pandas.indexpandas._libs.index
pandas.algospandas._libs.algos
pandas.hashtablepandas._libs.hashtable
pandas.indexespandas.core.indexes
pandas.jsonpandas._libs.json / pandas.io.jsonX
pandas.parserpandas._libs.parsersX
pandas.formatspandas.io.formats
pandas.sparsepandas.core.sparse
pandas.toolspandas.core.reshapeX
pandas.typespandas.core.dtypesX
pandas.io.sas.saslibpandas.io.sas._sas
pandas._joinpandas._libs.join
pandas._hashpandas._libs.hashing
pandas._periodpandas._libs.period
pandas._sparsepandas._libs.sparse
pandas._testingpandas._libs.testing
pandas._windowpandas._libs.window

Some new subpackages are created with public functionality that is not directly exposed in the top-level namespace: pandas.errors, pandas.plotting and pandas.testing (more details below). Together with pandas.api.types and certain functions in the pandas.io and pandas.tseries submodules, these are now the public subpackages.

Further changes:

pandas.errors

We are adding a standard public module for all pandas exceptions & warnings pandas.errors. (GH14800open in new window). Previously these exceptions & warnings could be imported from pandas.core.common or pandas.io.common. These exceptions and warnings will be removed from the *.common locations in a future release. (GH15541open in new window)

The following are now part of this API:

['DtypeWarning',
 'EmptyDataError',
 'OutOfBoundsDatetime',
 'ParserError',
 'ParserWarning',
 'PerformanceWarning',
 'UnsortedIndexError',
 'UnsupportedFunctionCall']

pandas.testing

We are adding a standard module that exposes the public testing functions in pandas.testing (GH9895open in new window). Those functions can be used when writing tests for functionality using pandas objects.

The following testing functions are now part of this API:

pandas.plotting

A new public pandas.plotting module has been added that holds plotting functionality that was previously in either pandas.tools.plotting or in the top-level namespace. See the deprecations sections for more details.

Other Development Changes

Deprecations

Deprecate .ix

The .ix indexer is deprecated, in favor of the more strict .iloc and .loc indexers. .ix offers a lot of magic on the inference of what the user wants to do. To wit, .ix can decide to index positionally OR via labels, depending on the data type of the index. This has caused quite a bit of user confusion over the years. The full indexing documentation is hereopen in new window. (GH14218open in new window)

The recommended methods of indexing are:

  • .loc if you want to label index
  • .iloc if you want to positionally index.

Using .ix will now show a DeprecationWarning with a link to some examples of how to convert code hereopen in new window.

In [130]: df = pd.DataFrame({'A': [1, 2, 3],
   .....:                    'B': [4, 5, 6]},
   .....:                   index=list('abc'))
   .....: 

In [131]: df
Out[131]: 
   A  B
a  1  4
b  2  5
c  3  6

[3 rows x 2 columns]

Previous behavior, where you wish to get the 0th and the 2nd elements from the index in the ‘A’ column.

In [3]: df.ix[[0, 2], 'A']
Out[3]:
a    1
c    3
Name: A, dtype: int64

Using .loc. Here we will select the appropriate indexes from the index, then use label indexing.

In [132]: df.loc[df.index[[0, 2]], 'A']
Out[132]: 
a    1
c    3
Name: A, Length: 2, dtype: int64

Using .iloc. Here we will get the location of the ‘A’ column, then use positional indexing to select things.

In [133]: df.iloc[[0, 2], df.columns.get_loc('A')]
Out[133]: 
a    1
c    3
Name: A, Length: 2, dtype: int64

Deprecate Panel

Panel is deprecated and will be removed in a future version. The recommended way to represent 3-D data are with a MultiIndex on a DataFrame via the to_frame() or with the xarray packageopen in new window. Pandas provides a to_xarray() method to automate this conversion (GH13563open in new window).

In [133]: import pandas.util.testing as tm

In [134]: p = tm.makePanel()

In [135]: p
Out[135]:
<class 'pandas.core.panel.Panel'>
Dimensions: 3 (items) x 3 (major_axis) x 4 (minor_axis)
Items axis: ItemA to ItemC
Major_axis axis: 2000-01-03 00:00:00 to 2000-01-05 00:00:00
Minor_axis axis: A to D

Convert to a MultiIndex DataFrame

In [136]: p.to_frame()
Out[136]:
                     ItemA     ItemB     ItemC
major      minor
2000-01-03 A      0.628776 -1.409432  0.209395
           B      0.988138 -1.347533 -0.896581
           C     -0.938153  1.272395 -0.161137
           D     -0.223019 -0.591863 -1.051539
2000-01-04 A      0.186494  1.422986 -0.592886
           B     -0.072608  0.363565  1.104352
           C     -1.239072 -1.449567  0.889157
           D      2.123692 -0.414505 -0.319561
2000-01-05 A      0.952478 -2.147855 -1.473116
           B     -0.550603 -0.014752 -0.431550
           C      0.139683 -1.195524  0.288377
           D      0.122273 -1.425795 -0.619993

[12 rows x 3 columns]

Convert to an xarray DataArray

In [137]: p.to_xarray()
Out[137]:
<xarray.DataArray (items: 3, major_axis: 3, minor_axis: 4)>
array([[[ 0.628776,  0.988138, -0.938153, -0.223019],
        [ 0.186494, -0.072608, -1.239072,  2.123692],
        [ 0.952478, -0.550603,  0.139683,  0.122273]],

       [[-1.409432, -1.347533,  1.272395, -0.591863],
        [ 1.422986,  0.363565, -1.449567, -0.414505],
        [-2.147855, -0.014752, -1.195524, -1.425795]],

       [[ 0.209395, -0.896581, -0.161137, -1.051539],
        [-0.592886,  1.104352,  0.889157, -0.319561],
        [-1.473116, -0.43155 ,  0.288377, -0.619993]]])
Coordinates:
  * items       (items) object 'ItemA' 'ItemB' 'ItemC'
  * major_axis  (major_axis) datetime64[ns] 2000-01-03 2000-01-04 2000-01-05
  * minor_axis  (minor_axis) object 'A' 'B' 'C' 'D'

Deprecate groupby.agg() with a dictionary when renaming

The .groupby(..).agg(..), .rolling(..).agg(..), and .resample(..).agg(..) syntax can accept a variable of inputs, including scalars, list, and a dict of column names to scalars or lists. This provides a useful syntax for constructing multiple (potentially different) aggregations.

However, .agg(..) can also accept a dict that allows ‘renaming’ of the result columns. This is a complicated and confusing syntax, as well as not consistent between Series and DataFrame. We are deprecating this ‘renaming’ functionality.

  • We are deprecating passing a dict to a grouped/rolled/resampled Series. This allowed one to rename the resulting aggregation, but this had a completely different meaning than passing a dictionary to a grouped DataFrame, which accepts column-to-aggregations.
  • We are deprecating passing a dict-of-dicts to a grouped/rolled/resampled DataFrame in a similar manner.

This is an illustrative example:

In [134]: df = pd.DataFrame({'A': [1, 1, 1, 2, 2],
   .....:                    'B': range(5),
   .....:                    'C': range(5)})
   .....: 

In [135]: df
Out[135]: 
   A  B  C
0  1  0  0
1  1  1  1
2  1  2  2
3  2  3  3
4  2  4  4

[5 rows x 3 columns]

Here is a typical useful syntax for computing different aggregations for different columns. This is a natural, and useful syntax. We aggregate from the dict-to-list by taking the specified columns and applying the list of functions. This returns a MultiIndex for the columns (this is not deprecated).

In [136]: df.groupby('A').agg({'B': 'sum', 'C': 'min'})
Out[136]: 
   B  C
A      
1  3  0
2  7  3

[2 rows x 2 columns]

Here’s an example of the first deprecation, passing a dict to a grouped Series. This is a combination aggregation & renaming:

In [6]: df.groupby('A').B.agg({'foo': 'count'})
FutureWarning: using a dict on a Series for aggregation
is deprecated and will be removed in a future version

Out[6]:
   foo
A
1    3
2    2

You can accomplish the same operation, more idiomatically by:

In [137]: df.groupby('A').B.agg(['count']).rename(columns={'count': 'foo'})
Out[137]: 
   foo
A     
1    3
2    2

[2 rows x 1 columns]

Here’s an example of the second deprecation, passing a dict-of-dict to a grouped DataFrame:

In [23]: (df.groupby('A')
    ...:    .agg({'B': {'foo': 'sum'}, 'C': {'bar': 'min'}})
    ...:  )
FutureWarning: using a dict with renaming is deprecated and
will be removed in a future version

Out[23]:
     B   C
   foo bar
A
1   3   0
2   7   3

You can accomplish nearly the same by:

In [138]: (df.groupby('A')
   .....:    .agg({'B': 'sum', 'C': 'min'})
   .....:    .rename(columns={'B': 'foo', 'C': 'bar'})
   .....:  )
   .....: 
Out[138]: 
   foo  bar
A          
1    3    0
2    7    3

[2 rows x 2 columns]

Deprecate .plotting

The pandas.tools.plotting module has been deprecated, in favor of the top level pandas.plotting module. All the public plotting functions are now available from pandas.plotting (GH12548open in new window).

Furthermore, the top-level pandas.scatter_matrix and pandas.plot_params are deprecated. Users can import these from pandas.plotting as well.

Previous script:

pd.tools.plotting.scatter_matrix(df)
pd.scatter_matrix(df)

Should be changed to:

pd.plotting.scatter_matrix(df)

Other deprecations

  • SparseArray.to_dense() has deprecated the fill parameter, as that parameter was not being respected (GH14647open in new window)
  • SparseSeries.to_dense() has deprecated the sparse_only parameter (GH14647open in new window)
  • Series.repeat() has deprecated the reps parameter in favor of repeats (GH12662open in new window)
  • The Series constructor and .astype method have deprecated accepting timestamp dtypes without a frequency (e.g. np.datetime64) for the dtype parameter (GH15524open in new window)
  • Index.repeat() and MultiIndex.repeat() have deprecated the n parameter in favor of repeats (GH12662open in new window)
  • Categorical.searchsorted() and Series.searchsorted() have deprecated the v parameter in favor of value (GH12662open in new window)
  • TimedeltaIndex.searchsorted(), DatetimeIndex.searchsorted(), and PeriodIndex.searchsorted() have deprecated the key parameter in favor of value (GH12662open in new window)
  • DataFrame.astype() has deprecated the raise_on_error parameter in favor of errors (GH14878open in new window)
  • Series.sortlevel and DataFrame.sortlevel have been deprecated in favor of Series.sort_index and DataFrame.sort_index (GH15099open in new window)
  • importing concat from pandas.tools.merge has been deprecated in favor of imports from the pandas namespace. This should only affect explicit imports (GH15358open in new window)
  • Series/DataFrame/Panel.consolidate() been deprecated as a public method. (GH15483open in new window)
  • The as_indexer keyword of Series.str.match() has been deprecated (ignored keyword) (GH15257open in new window).
  • The following top-level pandas functions have been deprecated and will be removed in a future version (GH13790open in new window, GH15940open in new window) pd.pnow(), replaced by Period.now()pd.Term, is removed, as it is not applicable to user code. Instead use in-line string expressions in the where clause when searching in HDFStore pd.Expr, is removed, as it is not applicable to user code. pd.match(), is removed. pd.groupby(), replaced by using the .groupby() method directly on a Series/DataFramepd.get_store(), replaced by a direct call to pd.HDFStore(...)
  • pd.pnow(), replaced by Period.now()
  • pd.Term, is removed, as it is not applicable to user code. Instead use in-line string expressions in the where clause when searching in HDFStore
  • pd.Expr, is removed, as it is not applicable to user code.
  • pd.match(), is removed.
  • pd.groupby(), replaced by using the .groupby() method directly on a Series/DataFrame
  • pd.get_store(), replaced by a direct call to pd.HDFStore(...)
  • is_any_int_dtype, is_floating_dtype, and is_sequence are deprecated from pandas.api.types (GH16042open in new window)

Removal of prior version deprecations/changes

Performance improvements

Bug fixes

Conversion

Indexing

I/O

Plotting

Groupby/resample/rolling

Sparse

Reshaping

Numeric

Other

Contributors

A total of 204 people contributed patches to this release. People with a “+” by their names contributed a patch for the first time.

  • Adam J. Stewart +
  • Adrian +
  • Ajay Saxena
  • Akash Tandon +
  • Albert Villanova del Moral +
  • Aleksey Bilogur +
  • Alexis Mignon +
  • Amol Kahat +
  • Andreas Winkler +
  • Andrew Kittredge +
  • Anthonios Partheniou
  • Arco Bast +
  • Ashish Singal +
  • Baurzhan Muftakhidinov +
  • Ben Kandel
  • Ben Thayer +
  • Ben Welsh +
  • Bill Chambers +
  • Brandon M. Burroughs
  • Brian +
  • Brian McFee +
  • Carlos Souza +
  • Chris
  • Chris Ham
  • Chris Warth
  • Christoph Gohlke
  • Christoph Paulik +
  • Christopher C. Aycock
  • Clemens Brunner +
  • D.S. McNeil +
  • DaanVanHauwermeiren +
  • Daniel Himmelstein
  • Dave Willmer
  • David Cook +
  • David Gwynne +
  • David Hoffman +
  • David Krych
  • Diego Fernandez +
  • Dimitris Spathis +
  • Dmitry L +
  • Dody Suria Wijaya +
  • Dominik Stanczak +
  • Dr-Irv
  • Dr. Irv +
  • Elliott Sales de Andrade +
  • Ennemoser Christoph +
  • Francesc Alted +
  • Fumito Hamamura +
  • Giacomo Ferroni
  • Graham R. Jeffries +
  • Greg Williams +
  • Guilherme Beltramini +
  • Guilherme Samora +
  • Hao Wu +
  • Harshit Patni +
  • Ilya V. Schurov +
  • Iván Vallés Pérez
  • Jackie Leng +
  • Jaehoon Hwang +
  • James Draper +
  • James Goppert +
  • James McBride +
  • James Santucci +
  • Jan Schulz
  • Jeff Carey
  • Jeff Reback
  • JennaVergeynst +
  • Jim +
  • Jim Crist
  • Joe Jevnik
  • Joel Nothman +
  • John +
  • John Tucker +
  • John W. O’Brien
  • John Zwinck
  • Jon M. Mease
  • Jon Mease
  • Jonathan Whitmore +
  • Jonathan de Bruin +
  • Joost Kranendonk +
  • Joris Van den Bossche
  • Joshua Bradt +
  • Julian Santander
  • Julien Marrec +
  • Jun Kim +
  • Justin Solinsky +
  • Kacawi +
  • Kamal Kamalaldin +
  • Kerby Shedden
  • Kernc
  • Keshav Ramaswamy
  • Kevin Sheppard
  • Kyle Kelley
  • Larry Ren
  • Leon Yin +
  • Line Pedersen +
  • Lorenzo Cestaro +
  • Luca Scarabello
  • Lukasz +
  • Mahmoud Lababidi
  • Mark Mandel +
  • Matt Roeschke
  • Matthew Brett
  • Matthew Roeschke +
  • Matti Picus
  • Maximilian Roos
  • Michael Charlton +
  • Michael Felt
  • Michael Lamparski +
  • Michiel Stock +
  • Mikolaj Chwalisz +
  • Min RK
  • Miroslav Šedivý +
  • Mykola Golubyev
  • Nate Yoder
  • Nathalie Rud +
  • Nicholas Ver Halen
  • Nick Chmura +
  • Nolan Nichols +
  • Pankaj Pandey +
  • Pawel Kordek
  • Pete Huang +
  • Peter +
  • Peter Csizsek +
  • Petio Petrov +
  • Phil Ruffwind +
  • Pietro Battiston
  • Piotr Chromiec
  • Prasanjit Prakash +
  • Rob Forgione +
  • Robert Bradshaw
  • Robin +
  • Rodolfo Fernandez
  • Roger Thomas
  • Rouz Azari +
  • Sahil Dua
  • Sam Foo +
  • Sami Salonen +
  • Sarah Bird +
  • Sarma Tangirala +
  • Scott Sanderson
  • Sebastian Bank
  • Sebastian Gsänger +
  • Shawn Heide
  • Shyam Saladi +
  • Sinhrks
  • Stephen Rauch +
  • Sébastien de Menten +
  • Tara Adiseshan
  • Thiago Serafim
  • Thoralf Gutierrez +
  • Thrasibule +
  • Tobias Gustafsson +
  • Tom Augspurger
  • Tong SHEN +
  • Tong Shen +
  • TrigonaMinima +
  • Uwe +
  • Wes Turner
  • Wiktor Tomczak +
  • WillAyd
  • Yaroslav Halchenko
  • Yimeng Zhang +
  • abaldenko +
  • adrian-stepien +
  • alexandercbooth +
  • atbd +
  • bastewart +
  • bmagnusson +
  • carlosdanielcsantos +
  • chaimdemulder +
  • chris-b1
  • dickreuter +
  • discort +
  • dr-leo +
  • dubourg
  • dwkenefick +
  • funnycrab +
  • gfyoung
  • goldenbull +
  • hesham.shabana@hotmail.com
  • jojomdt +
  • linebp +
  • manu +
  • manuels +
  • mattip +
  • maxalbert +
  • mcocdawc +
  • nuffe +
  • paul-mannino
  • pbreach +
  • sakkemo +
  • scls19fr
  • sinhrks
  • stijnvanhoey +
  • the-nose-knows +
  • themrmax +
  • tomrod +
  • tzinckgraf
  • wandersoncferreira
  • watercrossing +
  • wcwagner
  • xgdgsc +
  • yui-knk