Post

AI - Python - Panda Library

AIML - Python - Panda Library


1
2
3
4
5
6
7
8
9
10
11
12
hashwithnewdr = pd.merge(adhashdr, newdr, left_index=True, right_index=True, how='right', suffixes=('_2018hashed', '_new'))

hashwithnewdr['result'] = np.where(hashwithnewdr['0_2018hashed'] == hashwithnewdr['0_new'], 'no change', 'changed')

print (plainwithnewdr.loc[plainwithnewdr['result'] == "no change"])
dr4 = plainwithnewdr.loc[plainwithnewdr['result'] == "changed"]

dr4 = dr4['0_2018hashed'].isnull().sum()

output = pd.merge(old, new, on='email', how="right", suffixes=('_old', '_new'))


install

  1. install anaconda
1
2
3
4
5
6
7
8
9
10
11
# set up environment
$ cd /Users/luo/Documents/code/python3
$ conda create --name PythonEnv python=3.7 -y

$ conda update -n base -c defaults conda

# To activate this environment
$ conda activate PythonEnv

# To deactivate an active environment
$ conda deactivate
  1. install pandas and NumPy
1
2
3
conda install matplotlib -y
conda install pandas -y
conda install numPy -y

import

The Pandas library is usually imported under the alias pd, while the NumPy library is loaded as np.

1
2
import numpy as np
import pandas as pd

Pandas Data Frames

the DataFrame can contain data that is:

  • a Pandas Series Object:
    • Screen Shot 2020-10-10 at 12.07.23
    • a 1-dimensional labeled array
    • capable of holding any data type with axis labels or index.
    • An example of a Series object is one column from a DataFrame.
  • a Pandas DataFrame
    • Screen Shot 2020-10-10 at 12.08.30
    • a 2-dimensional labeled data structures
    • with columns of potentially different types
  • a Pandas Panel
    • Screen Shot 2020-10-10 at 12.10.14
    • a 3-dimensional labeled data contanier
  • a NumPy ndarray
    • can be a record or structured
  • a two-dimensional ndarray
  • dictionaries of one-dimensional ndarray’s, lists, dictionaries or Series.

series

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# =========================== Series
data = [1,2,3,4]
series1 = pd.Series(data)
series1
# 0    1
# 1    2
# 2    3
# 3    4

data = [[1,2,3,4],['a','b']]
series1 = pd.Series(data)
print(series1)
# 0    [1, 2, 3, 4]
# 1          [a, b]

print(type(series1))
# <class 'pandas.core.series.Series'>


data = [1,2,3]
series1 = pd.Series(data, index=['a','b','c'])
print(series1)
# a    1
# b    2
# c    3
print(series1['a':'b'])
# a    1
# b    2

np.ndarray and np.array()

the difference between np.ndarray and np.array() .

  • np.ndarray an actual data type
  • np.array() a function to make arrays from other data structures.

Structured arrays & Record arrays

  • Structured arrays
    • allow users to manipulate the data by named fields:
  • Record arrays
    • expand the properties of structured arrays, allow users to access fields of structured arrays by attribute not index.
    • You see below that the foo values are accessed in the r2 record array.

example a structured array of three tuples

  • The first element of each tuple will be called foo and will be of type int,
  • the second element will be named bar and will be a float.
1
2
3
4
5
6
7
8
9
10
# ----------------- A structured array: manipulate the data by named fields
my_array = np.ones(3, dtype=([('foo', int), ('bar', float)]))
print(my_array['foo'])
# [1 1 1]
print(my_array['bar'])
# [ 1.  1.  1.]

# ----------------- A record array: access fields of structured arrays by attribute
my_array2 = my_array.view(np.recarray)
print(my_array2.foo)

specify the index and column names

  • The index, indicates the difference in rows
  • the column, indicate the difference in columns.

DataFrame

1
2
3
4
5
# =========================== DataFrame
data = {'Country': ['Belgium',  'India',  'Brazil'],
        'Capital': ['Brussels',  'New Delhi',  'Brasilia'],
        'Population': [11190846, 1303171035, 207847528]}
df = pd.DataFrame(data, columns=['Country',  'Capital',  'Population'])
.CountryCapitalPopulation
1BelgiumBrussels11190846
2IndiaNew Delhi1303171035
3BrazilBrasilia207847528

many things that can serve as input to make a DataFrame

  1. Take array as input
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
data = [[1,2,3,4,5], [1,2,3]]
df = pd.DataFrame(data)
#   0	1	2	3	  4
# 0	1	2	3	4.0	5.0
# 1	1	2	3	NaN	NaN

data = [[1,2]]
#   0	1
# 0	1	2

data = [1,2]
#   0
# 0	1
# 1	2

df.rename(index = {0:"a", 1:"b"}, inplace=True)
df.rename(index = str, columns={0:"A"}, inplace=True)
# 	A
# a	1
# b	2


df.columns=["a"]
# 	a
# a	1
# b	2
  1. NumPy ndarray
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
my_2darray = np.array([[1, 2, 3], [4, 5, 6]])
print(pd.DataFrame(my_2darray))
#        0  1  2
#     0  1  2  3
#     1  4  5  6


data = np.array([['','Col1','Col2'], ['Row1',1,2], ['Row2',3,4]])
df = pd.DataFrame(data=data[1:,1:], index=data[1:,0], columns=data[0,1:])
#          Col1 Col2 (columns)
#     Row1    1    2
#     Row2    3    4
#   (index)


data = np.array([['a','b'], [6,12]])
df = pd.DataFrame( { 'A':data[:,0], 'B':data[:,1] } )
# 	A	B
# 0	a	b
# 1	6	12
  1. Take dictionary as input
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# 每个 key 的 value 数字必须一样 否则报错
# key = colunms

my_dict = {1: ['1', '3'], 2: ['1', '2'], 3: ['2', '4']}
print(pd.DataFrame(my_dict))
#       1  2  3
#    0  1  1  2
#    1  3  2  4


data = {"a": ['BigBlueberry', 'mango'], "b": [10,12]}
df = pd.DataFrame(data)
df["FruitName"] = df["a"]
#   a     b	  FruitName
# 0	apple	10	apple
# 1	mango	12	mango

  1. Take DataFrame as input
1
2
3
4
5
6
7
my_df = pd.DataFrame(data=[4,5,6,7], index=range(0,4), columns=['A'])
print(pd.DataFrame(my_df))
    #    A
    # 0  4
    # 1  5
    # 2  6
    # 3  7
  1. Take Series as input
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
my_series = pd.Series({"United Kingdom":"London", "India":"New Delhi", "United States":"Washington", "Belgium":"Brussels"})
print(pd.DataFrame(my_series))
    #                          0
    # Belgium           Brussels
    # India            New Delhi
    # United Kingdom      London
    # United States   Washington

data = pd.Series([6,12], index=['a','b'])
df = pd.DataFrame([data])
#   a	b
# 0	6	12
df = pd.DataFrame(data)
#   0
# a	6
# b	12

the index of Series (and DataFrame) contains the keys of the original dictionary, they are sorted

shape() and len() combination with the index

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6]]))

print(df)
    #    0  1  2
    # 0  1  2  3
    # 1  4  5  6

df.shape       # (2, 3)

len(df.index)  # 2

df[0].count()  # 2
# but this will exclude the NaN values (if there are any).
# .count() on DataFrame is the better option.

df.columns.values    # [0, 1, 2]
  • shape():
    • the dimensions of DataFrame.
    • the width and the height of DataFrame.
  • len() combination with the index
    • only give you information on the height of DataFrame.

Fundamental DataFrame Operations

Select Index or Column From a Pandas DataFrame

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# -    A  B  C
# - 0  1  2  3
# - 1  4  5  6
# - 2  7  8  9


# access the value that is at index 0, in column ‘A’.
# Using `iloc[]`
print(df.iloc[0][0])

# Using `loc[]`
print(df.loc[0]['A'])

# Using `iat[]`
print(df.iat[0,0])

# Using `at[]`
print(df.at[0,'A'])



# Use `iloc[]` to select a row
print(df.iloc[0])
    # A    1
    # B    2
    # C    3

# Use `loc[]` to select a column
print(df.loc[:,'A'])
    # 0    1
    # 1    4
    # 2    7

Add Index, Row or Column to a Pandas DataFrame

Add Index

1
2
3
4
5
# Print out DataFrame `df` to check it out
print(df)

# Set 'C' as the index of DataFrame
df.set_index('C')

Add Rows

  • .loc[]
    • works on labels of index.
    • loc[2]: the values that have an index labeled 2.
  • .iloc[]
    • works on the positions in index.
    • iloc[2]: the values that are at index 2.
  • .ix[]
    • when the index is integer-based, pass a label to .ix[].
    • ix[2]: look in DataFrame for values that have an index labeled 2.
    • if index is integer-based, ix[] like .loc[],
    • if index is not solely integer-based, ix[] will work with positions, just like .iloc[].
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
df = pd.DataFrame(data=np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), index= [2, 'A', 4], columns=[48, 49, 50])
    #    48  49  50
    # 2   1   2   3
    # A   4   5   6
    # 4   7   8   9

# Pass `2` to `loc`
print(df.loc[2])
    # 48    1
    # 49    2
    # 50    3

# Pass `2` to `iloc`
print(df.iloc[2])
    # 48    7
    # 49    8
    # 50    9

# Pass `2` to `ix`
print(df.ix[2])
    # 48    7
    # 49    8
    # 50    9

df = pd.DataFrame(data=np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), index= [2.5, 12.6, 4.8], columns=[48, 49, 50])
    #       48  49  50
    # 2.5    1   2   3
    # 12.6   4   5   6
    # 4.8    7   8   9


# no index labeled `2`, so change the index at position `2`
df.ix[2] = [60, 50, 40]
    #       48  49  50
    # 2.5    1   2   3
    # 12.6   4   5   6
    # 4.8   60  50  40


# make an index labeled `2` and add the new values
df.loc[2] = [11, 12, 13]
    #       48  49  50
    # 2.5    1   2   3
    # 12.6   4   5   6
    # 4.8   60  50  40
    # 2.0   11  12  13

Column

Add Column

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
df = pd.DataFrame(data=np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), index=['1st', '2nd','3rd'], columns=['A', 'B', 'C'])
    #      A  B  C
    # 1st  1  2  3
    # 2nd  4  5  6
    # 3rd  7  8  9


# ================================= Add Column
df['D'] = df.index
    #      A  B  C    D
    # 1st  1  2  3  1st
    # 2nd  4  5  6  2nd
    # 3rd  7  8  9  3rd


# ================================= Add Column value

df.loc[:, 4] = pd.Series(['5', '6','7'], index=df.index)  # .iloc[:, x] look by label
#      A  B  C    D  4
# 1st  1  2  3  1st  5
# 2nd  4  5  6  2nd  6
# 3rd  7  8  9  3rd  7


df.loc[:, 3] = pd.Series(['5', '6','7'], index=df.index)  # not exist, add
#      A  B  C    D  3
# 1st  1  2  3  1st  5
# 2nd  4  5  6  2nd  6
# 3rd  7  8  9  3rd  7


df.loc[:, 'D'] = pd.Series(['5', '6','7'], index=df.index) # exist, change
#      A  B  C  D
# 1st  1  2  3  5
# 2nd  4  5  6  6
# 3rd  7  8  9  7


df.iloc[:, 3] = pd.Series(['5', '6','7'], index=df.index)  # .iloc[:, x] look by index
#      A  B  C  D
# 1st  1  2  3  5
# 2nd  4  5  6  6
# 3rd  7  8  9  7

Series object is much like a column of a DataFrame.

change the columns

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
data = [1,2]

df = pd.DataFrame(data)

# 1.
df.rename(index = str, columns={0:"A"}, inplace=True)

# 2.
df.columns=["a"]

# 3.
coList=[]
for col in df.columns:
  if col = "a":
    col = "b"
  coList.append(col)
df.columns = coList

Reset the Index

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
df = pd.DataFrame(data=np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), index= [2.5, 12.6, 4.8], columns=[48, 49, 50])
    #       48  49  50
    # 2.5    1   2   3
    # 12.6   4   5   6
    # 4.8    7   8   9


# eset the values
df = df.reset_index(level=0, drop=True)
    #    48  49  50
    # 0   1   2   3
    # 1   4   5   6
    # 2   7   8   9

df_reset = df.reset_index(level=0, inplace=True)
    # None

drop argument: to get rid of the index that was there. inplace, the original index with floats is added as an extra column to your DataFrame.


Merge

Screen Shot 2020-10-11 at 21.20.54

YvuOa

BECid

8w1US

euLoe

pandas.DataFrame.merge

inner

blue indicates rows that are present in the merge result red indicates rows that are excluded from the result (i.e., removed) green indicates missing values that are replaced with NaNs in the result

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
left = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value': np.random.randn(4)})
right = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value': np.random.randn(4)})
# left
#   key     value
# 0   A  1.764052
# 1   B  0.400157
# 2   C  0.978738
# 3   D  2.240893
# right
#   key     value
# 0   B  1.867558
# 1   D -0.977278
# 2   E  0.950088
# 3   F -0.151357

left.merge(right, on='key')
left.merge(right, on='key', how='inner')
#   key   value_x   value_y
# 0   B  0.400157  1.867558
# 1   D  2.240893 -0.977278

example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
data1 = [["py1","NaN", 8, "NaN", "Game1", "NaN"], ["py2","None", 9, "NaN", "Game2", "NaN"]]
data2 = [["NaN", "py5", "NaN", "Kick", "NaN", "Game5"], ["NaN", "py6", "NaN", "Elbow", "NaN", "Game6"]]

df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
print(df1)
print(df2)
#      0     1  2    3      4    5
# 0  py1   NaN  8  NaN  Game1  NaN
# 1  py2  None  9  NaN  Game2  NaN
#      0    1    2      3    4      5
# 0  NaN  py5  NaN   Kick  NaN  Game5
# 1  NaN  py6  NaN  Elbow  NaN  Game6

pd.concat([df1, df2])
#   0	  1	    2	  3	    4	    5
# 0	py1	NaN	  8	  NaN	  Game1	NaN
# 1	py2	None	9	  NaN	  Game2	NaN
# 0	NaN	py5	  NaN	Kick	NaN	  Game5
# 1	NaN	py6	  NaN	Elbow	NaN	  Game6


data1 = ({"A":[1,2,3], "B":[4,5,6],"C":[1,2,3], "D":[4,5,6]})
data2 = ({"A":[3,8,9], "C":[1,0,1],"E":[7,8,9], "F":[1,0,1]})
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
print(df1)
print(df2)

pd.concat([df1, df2])
#   A	B	  C	D	  E	  F
# 0	1	4.0	1	4.0	NaN	NaN
# 1	2	5.0	2	5.0	NaN	NaN
# 2	3	6.0	3	6.0	NaN	NaN
# 0	3	NaN	1	NaN	7.0	1.0
# 1	8	NaN	0	NaN	8.0	0.0
# 2	9	NaN	1	NaN	9.0	1.0


# axis{0/’index’, 1/’columns’}, default 0
# The axis to concatenate along.

# 有全部数据的column
pd.concat([df1, df2], axis=0, join="inner")
# 	A	C
# 0	1	1
# 1	2	2
# 2	3	3
# 0	3	1
# 1	8	0
# 2	9	1
pd.concat([df1, df2], axis=1, join="inner")
# 	A	B	C	D	A	C	E	F
# 0	1	4	1	4	3	1	7	1
# 1	2	5	2	5	8	0	8	0
# 2	3	6	3	6	9	1	9	1

pandas.DataFrame.merge

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
# how: {‘left’, ‘right’, ‘outer’, ‘inner’}, default ‘inner’
# on: label or list, Column or index level names to join on. These must be found in both DataFrames. If on is None and not merging on indexes then this defaults to the intersection of the columns in both DataFrames.

data1 = ({"A":[1,2,3], "B":[4,5,6]})
data2 = ({"A":[3,8,9], "C":[1,0,1]})
#    A  B
# 0  1  4
# 1  2  5
# 2  3  6
#    A  B
# 0  3  1
# 1  8  5
# 2  9  1

pd.merge(df1, df2, on=["A"], how="inner")
# 	A	B	C
# 0	3	6	1

pd.merge(df1, df2, on=["B"], how="inner")
# 	A_x	B	A_y
# 0	  2	5	8

pd.merge(df1, df2, on=["B"], how="inner", suffixes=('_left', '_right'))
# 	A_left	B	A_right
# 0	     2	5	8


df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'], 'value': [1, 2, 3, 5]})
df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'], 'value': [5, 6, 7, 8]})
#     lkey value
# 0   foo      1
# 1   bar      2
# 2   baz      3
# 3   foo      5
#     rkey value
# 0   foo      5
# 1   bar      6
# 2   baz      7
# 3   foo      8

df1.merge(df2, left_on='lkey', right_on='rkey')
#   lkey  value_x rkey  value_y
# 0  foo        1  foo        5
# ...

I/O

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# Read and Write to CSV
pd.read_csv('file.csv', header=None, nrows=5)
df.to_csv('myDataFrame.csv')

data = pd.read_csv('/Users/luo/Documents/code/python3/jupyter/passwd/new/Post3.txt',sep=":", header=None)
data.columns = ["a", "b", "c", "etc."]


# Read multiple sheets from the same file
xlsx = pd.ExcelFile('file.xls')
df = pd.read_excel(xlsx, 'Sheet1')

# Read and Write to Excel
pd.read_excel('file.xlsx')
df.to_excel('dir/myDataFrame.xlsx',  sheet_name='Sheet1')

# Read and Write to SQL Query or Database Table
# (read_sql()is a convenience wrapper around read_sql_table() and read_sql_query())

from sqlalchemy import create_engine
engine = create_engine('sqlite:///:memory:')
pd.read_sql(SELECT * FROM my_table;, engine)
pd.read_sql_table('my_table', engine)
pd.read_sql_query(SELECT * FROM my_table;', engine)
df.to_sql('myDf', engine)


Selection


get

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import numpy as np
import pandas as pd

data = {'Country': ['Belgium',  'India',  'Brazil'],
        'Capital': ['Brussels',  'New Delhi',  'Brasilia'],
        'Population': [11190846, 1303171035, 207847528]}
df = pd.DataFrame(data, columns=['Country',  'Capital',  'Population'])

# . | Country	| Capital	| Population
# 1	| Belgium	| Brussels	| 11190846
# 2	| India	    | New Delhi	| 1303171035
# 3	| Brazil	| Brasilia	| 207847528


# Get one element
df['Capital']
0     Brussels
1     New Delhi
2     Brasilia

# Get subset of a DataFrame
>>> df[1:]
Country     Capital   Population
1  India    New Delhi 1303171035
2  Brazil   Brasilia  207847528

Selecting’, Boolean Indexing and Setting

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# By Position
# Select single value by row and and column
>>> df.iloc([0], [0])
'Belgium'
>>> df.iat([0], [0])
'Belgium'

===============================================================

# By Label
# Select single value by row and column labels
>>> df.loc([0],  ['Country'])
'Belgium'
>>> df.at([0],  ['Country'])
'Belgium'

===============================================================

# By Label/Position

# Select single row of subset of rows
>>> df.ix[2]
Country      Brazil
Capital      Brasilia
Population   207847528

# Select a single column of subset of columns
>>> df.ix[:, 'Capital']
0     Brussels
1    New Delhi
2     Brasilia

# Select rows and columns
>>> df.ix[1, 'Capital']
'New Delhi'

Boolean Indexing

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Series s where value is not >1
>>> s[~(s > 1)]

# s where value is <-1 or >2
>>> s[(s < -1) | (s > 2)]

# Use filter to adjust DataFrame
>>> df[df['Population']>1200000000]


# Setting
# Set index a of Series s to 6
>>> s['a'] = 6

Dropping

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#    Country    Capital  Population
# 0  Belgium   Brussels    11190846
# 1    India  New Delhi  1303171035
# 2   Brazil   Brasilia   207847528


# Drop values from rows (axis=0)
>>> s.drop(['a',  'c'])

a = df.drop([2])
#    Country    Capital  Population
# 0  Belgium   Brussels    11190846
# 1    India  New Delhi  1303171035



# Drop values from columns(axis=1)
>>> df.drop('Country', axis=1)


# Sort and Rank
# Sort by labels along an axis
>>> df.sort_index()

# Sort by the values along an axis
>>> df.sort_values(by='Country')

# Assign ranks to entries
>>> df.rank()

Retrieving Series/DataFrame Information

Basic Information

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# (rows, columns)
>>> df.shape

# Describe index
>>> df.index

# Describe DataFrame columns
>>> df.columns

# Info on DataFrame
>>> df.info()

# Number of non-NA values
>>> df.count()

Summary

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Sum of values
>>> df.sum()

# Cumulative sum of values
>>> df.cumsum()

# Minimum/maximum values
>>> df.min()/df.max()

# Minimum/Maximum index value
>>> df.idxmin()/df.idxmax()

# Summary statistics
>>> df.describe()

# Mean of values
>>> df.mean()

# Median of values
>>> df.median()

ref:

.

This post is licensed under CC BY 4.0 by the author.

Comments powered by Disqus.