[파머완 1장] 4. 판다스 (Pandas)

Notice

Recent Posts

Recent Comments

Link

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Tags more

Archives

Today

Total

관리 메뉴

hyeonzzz's Tech Blog

[파머완 1장] 4. 판다스 (Pandas) 본문

Machine Learning

[파머완 1장] 4. 판다스 (Pandas)

hyeonzzz 2024. 1. 4. 15:20

1. 파이썬 기반의 머신러닝과 생태계 이해 - 판다스

1) 데이터 핸들링 - 판다스

DataFrame : 여러 개의 행과 열로 이뤄진 2차원 데이터를 담는 데이터 구조체, 칼럼이 여러 개

Index : 개별 데이터를 고유하게 식별하는 Key 값

Series : 칼럼이 하나뿐인 데이터 구조체

DataFrame은 여러 개의 Series로 이뤄졌다

2) 판다스 시작 - 파일을 DataFrame으로 로딩, 기본 API

read_table() : 필드 구분 문자가 탭('\t') = read_csv('파일명', sep='\t')

read_csv() : 필드 구분 문자가 콤마(',')

filepath 입력

titanic_df = pd.read_csv('titanic_train.csv')
print('titanic 변수 type:',type(titanic_df))
titanic_df

주피터 노트북과 csv파일이 같은 디렉터리에 있다면 read_csv('titanic_train.csv')만 입력 가능

pd.read_csv

- 파일을 로딩해 DataFrame 객체로 반환

- 맨 처음 로우를 칼럼명으로 인지하고 칼럼으로 변환

- 생성되는 순간 고유의 Index 값을 가진다

일부 데이터 표출

titanic_df.head(3)

행과 열 크기 - shape

print('DataFrame 크기: ', titanic_df.shape)

DataFrame 크기: (891, 12)

info( )

titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890    - 전체 row 수
Data columns (total 12 columns):        - 전체 column 수
#   Column       Non-Null Count  Dtype
---  ------       --------------  -----
0   PassengerId  891 non-null    int64
1   Survived     891 non-null    int64
2   Pclass       891 non-null    int64
3   Name         891 non-null    object :        - column별 데이터 타입. object는 문자열
4   Sex          891 non-null    object
5   Age          714 non-null    float64 :        - 714개가 Null이 아니며 177개는 Null
6   SibSp        891 non-null    int64
7   Parch        891 non-null    int64
8   Ticket       891 non-null    object
9   Fare         891 non-null    float64
10  Cabin        204 non-null    object
11  Embarked     889 non-null    object
dtypes: float64(2), int64(5), object(5) :        - 전체 12개의 column들의 타입 요약
memory usage: 83.7+ KB

describe( )

오직 숫자형 column의 분포도만 조사

자동으로 object타입의 column은 출력에서 제외시킨다

titanic_df.describe()

count - Not Null인 데이터 건수

mean - 전체 데이터의 평균값

std - 표준편차

min - 최솟값

max - 최댓값

value_counts( )

해당 칼럼값의 유형과 건수를 확인할 수 있다

많은 건수 순서로 정렬되어 반환한다

value_counts = titanic_df['Pclass'].value_counts()
print(value_counts)

3    491
1    216
2    184
Name: Pclass, dtype: int64

series

series는 index와 단 하나의 column으로 구성된 데이터 세트이다

titanic_pclass.head()

0    3         - 0은 인덱스. DataFrame의 인덱스와 동일
1    1
2    3
3    1
4    3
Name: Pclass, dtype: int64

value_counts( )메서드는 DataFrame객체에서도 호출할 수 있다. 하지만 단일 column으로 되어 있는 Series 객체에서 호출하는 것이 더 데이터값의 분포도를 파악하기 쉽다

value_counts = titanic_df['Pclass'].value_counts()
print(type(value_counts))
print(value_counts)

<class 'pandas.core.series.Series'>
3    491
1    216
2    184
Name: Pclass, dtype: int64

value_counts( )가 반환하는 데이터 타입 역시 Series 객체이다. 하지만 index값이 3 1 2로 다르다
이처럼 고유성이 보장된다면 의미 있는 데이터값 할당도 가능하다.
숫자형이 아니라 문자열 index도 가능하다
기본값은 Null 값을 무시하고 개별 데이터 값의 건수를 계산한다

print('titanic_df 데이터 건수:', titanic_df.shape[0])
print('기본 설정인 dropna=True로 value_counts()')
# value_counts()는 디폴트로 dropna=True이므로 value_counts(dropna=True)와 동일.
print(titanic_df['Embarked'].value_counts())
print(titanic_df['Embarked'].value_counts(dropna=False))

titanic_df 데이터 건수: 891
기본 설정인 dropna=True로 value_counts()
S    644
C    168
Q     77
Name: Embarked, dtype: int64
S      644
C      168
Q       77
NaN      2          - Null 데이터를 포함하여 연산을 수행한다
Name: Embarked, dtype: int64

3) DataFrame과 리스트, 딕셔너리, 넘파이 ndarray 상호 변환

DataFrame

DataFrame은 리스트, 딕셔너리, 넘파이 ndarray 등 다양한 데이터로부터 생성되고 다양한 데이터로 변환될 수 있다

DataFrame과 넘파이 ndarray 상호 변환은 매우 빈번하게 일어난다

넘파이 ndarray, 리스트, 딕셔너리를 DataFrame으로 변환하기

DataFrame은 리스트와 넘파이 ndarray와 다르게 칼럼명을 가지고 있다
칼럼명으로 인해 편한 데이터 핸들링이 가능하다
DataFrame으로 변환 시 칼럼명을 지정해 준다
DataFrame은 2차원 데이터이기 때문에 2차원 이하의 데이터들만 DataFrame으로 변환될 수 있다

1차원 형태를 DataFrame으로 변환하기

1차원 데이터이므로 칼럼은 1개만 필요

칼럼명을 'col1'으로 지정

import numpy as np

col_name1=['col1']
list1 = [1, 2, 3]
array1 = np.array(list1)
print('array1 shape:', array1.shape )
# 리스트를 이용해 DataFrame 생성.
df_list1 = pd.DataFrame(list1, columns=col_name1)
print('1차원 리스트로 만든 DataFrame:\n', df_list1)
# 넘파이 ndarray를 이용해 DataFrame 생성.
df_array1 = pd.DataFrame(array1, columns=col_name1)
print('1차원 ndarray로 만든 DataFrame:\n', df_array1)

array1 shape: (3,)
1차원 리스트로 만든 DataFrame:
    col1
0     1
1     2
2     3
1차원 ndarray로 만든 DataFrame:
    col1
0     1
1     2
2     3

2차원 형태를 DataFrame으로 변환하기

2행 3열 형태의 리스트와 ndarray이므로 칼럼명을 3개가 필요

# 3개의 칼럼명이 필요함.
col_name2=['col1', 'col2', 'col3']

# 2행x3열 형태의 리스트와 ndarray 생성한 뒤 이를 DataFrame으로 변환.
list2 = [[1, 2, 3],
         [11, 12, 13]]
array2 = np.array(list2)
print('array2 shape:', array2.shape )
df_list2 = pd.DataFrame(list2, columns=col_name2)
print('2차원 리스트로 만든 DataFrame:\n', df_list2)
df_array2 = pd.DataFrame(array2, columns=col_name2)
print('2차원 ndarray로 만든 DataFrame:\n', df_array2)

array2 shape: (2, 3)
2차원 리스트로 만든 DataFrame:
    col1  col2  col3
0     1     2     3
1    11    12    13
2차원 ndarray로 만든 DataFrame:
    col1  col2  col3
0     1     2     3
1    11    12    13

딕셔너리를 DataFrame으로 변환하기

Key (문자열) → 칼럼명

Value (리스트 또는 ndarray) → 칼럼 데이터

# Key는 컬럼명으로 매핑, Value는 리스트 형(또는 ndarray)
dict = {'col1':[1, 11], 'col2':[2, 22], 'col3':[3, 33]}
df_dict = pd.DataFrame(dict)
print('딕셔너리로 만든 DataFrame:\n', df_dict)

딕셔너리로 만든 DataFrame:
    col1  col2  col3
0     1     2     3
1    11    22    33

DataFrame을 넘파이 ndarray, 리스트, 딕셔너리로 변환하기

values

DataFrame을 넘파이 ndarray로 변환

# DataFrame을 ndarray로 변환
array3 = df_dict.values
print('df_dict.values 타입:', type(array3), 'df_dict.values shape:', array3.shape)
print(array3)

df_dict.values 타입: <class 'numpy.ndarray'> df_dict.values shape: (2, 3)
[[ 1 2 3]
[11 22 33]]

tolist( )

DataFrame을 리스트로 변환

to_dict( )

DataFrame을 딕셔너리로 변환

# DataFrame을 리스트로 변환
list3 = df_dict.values.tolist()
print('df_dict.values.tolist() 타입:', type(list3))
print(list3)

# DataFrame을 딕셔너리로 변환
dict3 = df_dict.to_dict('list')
print('\n df_dict.to_dict() 타입:', type(dict3))
print(dict3)

df_dict.values.tolist() 타입: <class 'list'>
[[1, 2, 3], [11, 22, 33]]

df_dict.to_dict() 타입: <class 'dict'>
{'col1': [1, 11], 'col2': [2, 22], 'col3': [3, 33]}

4) DataFrame의 칼럼 데이터 세트 생성과 수정

Age_0 칼럼 추가하고 0 할당하기

titanic_df['Age_0']=0
titanic_df.head(3)

[ ]연산자를 이용하면 된다

기존 Series의 데이터를 이용해 새로운 칼럼 Series 만들기

titanic_df['Age_by_10'] = titanic_df['Age']*10
titanic_df['Family_No'] = titanic_df['SibSp'] + titanic_df['Parch']+1
titanic_df.head(3)

기존 칼럼 값 수정하기

titanic_df['Age_by_10'] = titanic_df['Age_by_10'] + 100
titanic_df.head(3)

5) DataFrame 데이터 삭제

drop( )

중요한 파라미터

labels
axis

- 특정 칼럼 또는 특정 행을 드롭한다

- axis = 1은 칼럼을 드롭하겠다는 의미이다

- axis = 0으로 설정하는 경우는 이상치 데이터를 삭제하는 경우에 주로 사용한다

titanic_drop_df = titanic_df.drop('Age_0', axis=1 )
titanic_drop_df.head(3)

하지만 원본 Titanic DataFrame에 'Age_0' 칼럼은 여전히 존재한다. inplace=False가 디폴트값이기 때문!

inplace

inplace=True로 설정하면 원본 DataFrame의 데이터를 삭제한다

'Age_0', 'Age_by_10', 'Family_No' 세 개의 칼럼 삭제

drop_result = titanic_df.drop(['Age_0', 'Age_by_10', 'Family_No'], axis=1, inplace=True)
print(' inplace=True 로 drop 후 반환된 값:',drop_result)
titanic_df.head(3)

주의할 점은 반환 값이 None이 된다는 것이다. 따라서 inplace=True로 설정한 채 반환 값을 다시 자신의 DataFrame 객체로 할당하면 안된다

axis=0으로 설정해 index 0, 1, 2 로우 삭제

pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth', 15)
print('#### before axis 0 drop ####')
print(titanic_df.head(3))

titanic_df.drop([0,1,2], axis=0, inplace=True)

print('#### after axis 0 drop ####')
print(titanic_df.head(3))

6) Index 객체

Index 객체는 DataFrame, Series의 레코드를 고유하게 식별하는 객체이다

Index 객체 추출

Index 객체의 실제 값은 넘파이 1차원 ndarray로 볼 수 있다

# 원본 파일 재 로딩 
titanic_df = pd.read_csv('titanic_train.csv')
# Index 객체 추출
indexes = titanic_df.index
print(indexes)
# Index 객체를 실제 값 arrray로 변환 
print('Index 객체 array값:\n',indexes.values)

print(type(indexes.values))
print(indexes.values.shape)
print(indexes[:5].values)
print(indexes.values[:5])
print(indexes[6])

<class 'numpy.ndarray'> - Index 객체의 실제 값은 넘파이 1차원 ndarray
(891,)   - 1차원 array
[0 1 2 3 4]   - 슬라이싱 가능
[0 1 2 3 4]
6   - 단일 값 반환 가능

주의 : Index 객체는 함부로 변경할 수 없다

series_fair = titanic_df['Fare']
print('Fair Series max 값:', series_fair.max())
print('Fair Series sum 값:', series_fair.sum())
print('sum() Fair Series:', sum(series_fair))
print('Fair Series + 3:\n',(series_fair + 3).head(3) )

Series 객체에 연산 함수를 적용할 때 Index는 연산에서 제외된다

reset_index( )

titanic_reset_df = titanic_df.reset_index(inplace=False)
titanic_reset_df.head(3)

reset_index( )메서드를 수행하면 새로운 인덱스가 할당되고

기존 인덱스는 'index'라는 새로운 칼럼명으로 추가된다

print('### before reset_index ###')
value_counts = titanic_df['Pclass'].value_counts()
print(value_counts)
print('value_counts 객체 변수 타입:',type(value_counts))

new_value_counts = value_counts.reset_index(inplace=False)
print('### After reset_index ###')
print(new_value_counts)
print('new_value_counts 객체 변수 타입:',type(new_value_counts))

### before reset_index ###
3    491
1    216
2    184
Name: Pclass, dtype: int64
value_counts 객체 변수 타입: <class 'pandas.core.series.Series'>
### After reset_index ###
   index  Pclass   - 기존 Index는 'index' 칼럼명으로 추가된다
0      3     491   - 칼럼이 2개가 되므로 Series가 아닌 DataFrame
1      1     216
2      2     184
new_value_counts 객체 변수 타입: <class 'pandas.core.frame.DataFrame'>

drop=True로 설정하면 기존 인덱스는 삭제된다

7) 데이터 셀렉션 및 필터링

[ ]연산자

현재 수준에서는 뒤에 있는 [ ]는 칼럼 지정 연산자로 이해하기
[ ]내에 인덱스 형태로 변환 가능한 숫자 값은 입력할 수 있다

titanic_df[0:2]

Boolean 인덱싱 표현도 가능하다

titanic_df[ titanic_df['Pclass'] == 3].head(3)

iloc[ ], loc[ ]

iloc[ ] - 위치 기반 인덱싱 방식

data_df.iloc[0, 0]

print("\n 맨 마지막 칼럼 데이터 [:, -1] \n", data_df.iloc[:, -1])
print("\n 맨 마지막 칼럼을 제외한 모든 데이터 [:, :-1] \n", data_df.iloc[: , :-1])

loc[ ] - 명칭 기반 인덱싱 방식

data_df.loc['one', 'Name']

print('위치기반 iloc slicing\n', data_df.iloc[0:1, 0],'\n')
print('명칭기반 loc slicing\n', data_df.loc['one':'two', 'Name'])

위치기반 iloc slicing
one    Chulmin
Name: Name, dtype: object

명칭기반 loc slicing
one     Chulmin
two    Eunkyung
Name: Name, dtype: object

명칭 기반은 슬라이싱 기호를 적용할 때 종료 값까지 포함한다

불린 인덱싱

[ ], loc[ ]에서 공통으로 지원한다

titanic_df = pd.read_csv('titanic_train.csv')
titanic_boolean = titanic_df[titanic_df['Age'] > 60]
print(type(titanic_boolean))
titanic_boolean

'Age' 칼럼 값이 60보다 큰 데이터를 모두 반환

반환된 titanic_boolean 객체의 타입은 DataFrame 이다

titanic_df[titanic_df['Age'] > 60][['Name','Age']].head(3)

원하는 칼럼명만 별도로 추출한다. 칼럼이 두 개 이상이므로 [ [ ] ] 사용한다

titanic_df.loc[titanic_df['Age'] > 60, ['Name','Age']].head(3)

loc[ ] 이용

titanic_df[ (titanic_df['Age'] > 60) & (titanic_df['Pclass']==1) & (titanic_df['Sex']=='female')]

cond1 = titanic_df['Age'] > 60
cond2 = titanic_df['Pclass']==1
cond3 = titanic_df['Sex']=='female'
titanic_df[ cond1 & cond2 & cond3]

복합 조건 이용

8) 정렬, Aggregation함수, GroupBy 적용

sort_values( )

by - 특정 칼럼을 입력
ascending - 오름차순/내림차순
inplace - 정렬 결과 적용

titanic_sorted = titanic_df.sort_values(by=['Name'])
titanic_sorted.head(3)

Name 칼럼으로 오름차순 정렬해 반환

titanic_sorted = titanic_df.sort_values(by=['Pclass', 'Name'], ascending=False)
titanic_sorted.head(3)

Pclass와 Name을 내림차순으로 정렬

Aggregation 함수 적용

titanic_df.count()

titanic_df[['Age', 'Fare']].mean()

Age 29.699118
Fare 32.204208
dtype: float64

groupby( ) 적용

titanic_groupby = titanic_df.groupby('Pclass').count()
titanic_groupby

groupby( ) 대상 칼럼을 제외한 모든 칼럼에 해당 aggregation 함수를 적용한다

titanic_df.groupby('Pclass')['Age'].agg([max, min])

여러개의 aggregation 함수를 사용할 경우 agg( )내에 인자로 입력한다

agg_format={'Age':'max', 'SibSp':'sum', 'Fare':'mean'}
titanic_df.groupby('Pclass').agg(agg_format)

칼럼마다 다른 aggregation함수를 적용할 때는 딕셔너리 형태로 입력한다

9) 결손 데이터 처리하기

결손 데이터 : 칼럼에 값이 없는 NULL인 경우 (NaN)

머신러닝은 NaN값을 처리하지 않으므로 다른 값으로 대체해야 한다

isna( ) - NaN 여부 확인

titanic_df.isna().head(3)

칼럼의 값이 NaN인지 아닌지를 True나 False로 알려준다

isna( ).sum( ) - 결손 데이터의 개수

titanic_df.isna( ).sum( )

True는 1, False는 0으로 변환된다

fillna( ) - 결손 데이터 대체하기

titanic_df['Cabin'] = titanic_df['Cabin'].fillna('C000')
titanic_df.head(3)

Cabin 칼럼의 NaN값을 'C000'으로 대체

fillna( )를 이용해 반환 값을 다시 받거나 / inplace=True를 추가해야 실제 데이터 세트 값이 변경된다!

titanic_df['Age'] = titanic_df['Age'].fillna(titanic_df['Age'].mean())
titanic_df['Embarked'] = titanic_df['Embarked'].fillna('S')
titanic_df.isna().sum()

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64

Age칼럼의 NaN 값을 평균 나이로, Embarked칼럼의 NaN 값을 S로 대체

10) apply lambda 식으로 데이터 가공

apply lambda 사용하는 경우 - 복잡한 데이터 가공이 필요한 경우

titanic_df['Name_len']= titanic_df['Name'].apply(lambda x : len(x))
titanic_df[['Name','Name_len']].head(3)

titanic_df['Child_Adult'] = titanic_df['Age'].apply(lambda x : 'Child' if x <=15 else 'Adult' )
titanic_df[['Age','Child_Adult']].head(8)

if else를 사용할 때 주의할 점은 if 식보다 반환 값을 먼저 기술해야 한다는 것이다

titanic_df['Age_cat'] = titanic_df['Age'].apply(lambda x : 'Child' if x<=15 else ('Adult' if x <= 60 else 
                                                                                  'Elderly'))
titanic_df['Age_cat'].value_counts()

else if는 지원하지 않기 때문에 ( )내에 다시 if else를 적용해 사용한다

def get_category(age):
    cat = ''
    if age <= 5: cat = 'Baby'
    elif age <= 12: cat = 'Child'
    elif age <= 18: cat = 'Teenager'
    elif age <= 25: cat = 'Student'
    elif age <= 35: cat = 'Young Adult'
    elif age <= 60: cat = 'Adult'
    else : cat = 'Elderly'
        
    return cat

# lambda 식에 위에서 생성한 get_category( ) 함수를 반환값으로 지정. 
# get_category(X)는 입력값으로 ‘Age’ 컬럼 값을 받아서 해당하는 cat 반환
titanic_df['Age_cat'] = titanic_df['Age'].apply(lambda x : get_category(x))
titanic_df[['Age','Age_cat']].head()

별도의 함수를 만들어 세분화된 분류를 할 수 있다

'Machine Learning' 카테고리의 다른 글

[파머완 2장] 4. Model Selection 모듈 소개 (1)	2024.01.09
[파머완 2장] 3. 사이킷런의 기반 프레임워크 익히기 (0)	2024.01.04
[파머완 2장] 1. 사이킷런 소개와 특징 2. 붓꽃 품종 예측하기 (1)	2024.01.04
[파머완 1장] 3. 넘파이 (NumPy) (0)	2024.01.02
[파머완 1장] 1. 머신러닝의 개념 (0)	2024.01.02

'Machine Learning' Related Articles

hyeonzzz's Tech Blog

[파머완 1장] 4. 판다스 (Pandas) 본문

[파머완 1장] 4. 판다스 (Pandas)

1. 파이썬 기반의 머신러닝과 생태계 이해 - 판다스

1) 데이터 핸들링 - 판다스

2) 판다스 시작 - 파일을 DataFrame으로 로딩, 기본 API

filepath 입력

일부 데이터 표출

행과 열 크기 - shape

info( )

describe( )

value_counts( )

series

3) DataFrame과 리스트, 딕셔너리, 넘파이 ndarray 상호 변환

DataFrame

넘파이 ndarray, 리스트, 딕셔너리를 DataFrame으로 변환하기

1차원 형태를 DataFrame으로 변환하기

2차원 형태를 DataFrame으로 변환하기

딕셔너리를 DataFrame으로 변환하기

DataFrame을 넘파이 ndarray, 리스트, 딕셔너리로 변환하기

4) DataFrame의 칼럼 데이터 세트 생성과 수정

5) DataFrame 데이터 삭제

6) Index 객체

7) 데이터 셀렉션 및 필터링

8) 정렬, Aggregation함수, GroupBy 적용

sort_values( )

Aggregation 함수 적용

groupby( ) 적용

9) 결손 데이터 처리하기

isna( ) - NaN 여부 확인

isna( ).sum( ) - 결손 데이터의 개수

fillna( ) - 결손 데이터 대체하기

10) apply lambda 식으로 데이터 가공

'Machine Learning' 카테고리의 다른 글

티스토리툴바