๐Ÿ Python

231221 THU ํŒŒ์ด์ฌ ๋ณต์Šต (1) ํŒŒ์ด์ฌ ๋ฐ์ดํ„ฐ ๋ถ„์„ ๋ฐ ์‹œ๊ฐํ™” ๊ธฐ์ดˆ

ํ–‰ํŒฝ 2023. 12. 21. 13:10

0. ์ฐธ๊ณ 

 

(1) matplotlib API Reference

(2) ์œ„ํ‚ค๋…์Šค

 

 

1. Colab ์‚ฌ์šฉ ํŒ

 

(1) pandas์—์„œ ์—‘์…€ ํŒŒ์ผ ์—ด๊ธฐ

titanic = pd.read_excel('ํŒŒ์ผ์ด๋ฆ„.xlsx',engine='openpyxl')

 

(2) ํŒŒ์ผ ๊ฒฝ๋กœ๋กœ ์—ด๊ธฐ

titanic = pd.read_table('ํŒŒ์ผ๊ฒฝ๋กœ',sep=',')

 

 

 

2. ๋ฐ์ดํ„ฐ ๋ถ„์„์— ํ•„์š”ํ•œ ํŒŒ์ด์ฌ ๊ธฐ์ดˆ

 

(1) ๋ณ€์ˆ˜ : ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ด๋Š” ์ปจํ…Œ์ด๋„ˆ

(2) ๋ฆฌ์ŠคํŠธ : ์ธ๋ฑ์Šค(์ˆœ์„œ, 0๋ถ€ํ„ฐ ์‹œ์ž‘)๊ฐ€ ์žˆ๋Š” ๋ฐ์ดํ„ฐ๋“ค์˜ ๋ชจ์Œ์ง‘

(3) ๋”•์…”๋„ˆ๋ฆฌ : ์Œ (key-value)์œผ๋กœ ์ด๋ฃจ์–ด์ง„ ๋ฐ์ดํ„ฐ๋“ค์˜ ๋ชจ์Œ์ง‘

#๋ณ€์ˆ˜ ์„ ์–ธ
x=5, y=3, z="hello"

#๋ณ€์ˆ˜ ํ˜ธ์ถœ
print(x+y)
print(z)


#๋ฆฌ์ŠคํŠธ ์„ ์–ธ w. ๋Œ€๊ด„ํ˜ธ []
a_list=[1, 2, 3]
b_list=[1, 2, 'hello', 4]
list_ex=[3, 4, [5, 6], 7]

#๋ฆฌ์ŠคํŠธ ํ˜ธ์ถœ
a_list[1]       # 2 ์ถœ๋ ฅ
b_list[2]       # hello ์ถœ๋ ฅ
list_ex[2]      # [5, 6] ์ถœ๋ ฅ
list_ex[2][0]   # 5 ์ถœ๋ ฅ


#๋”•์…”๋„ˆ๋ฆฌ ์„ ์–ธ w. ์ค‘๊ด„ํ˜ธ {}
student_age={'Jack':32, 'Mark':22, 'John':25}
dic_exercise = {'name':'bob','age':21,'height':180}

#๋”•์…”๋„ˆ๋ฆฌ ํ˜ธ์ถœ
student_age{'John'}      # 25 ์ถœ๋ ฅ
dic_exercise['height']   # 180 ์ถœ๋ ฅ

 

 

 

3. ๋ฐ์ดํ„ฐ ๋ถ„์„์— ํ•„์š”ํ•œ ์‹œ๊ฐํ™” ๊ธฐ์ดˆ

 

(1) ๋ฐ์ดํ„ฐ ๋ถ„์„ ๊ธฐ๋ณธ ์„ธํŒ… & ๋ถ„์„

 

  • dropna() : ๊ฒฐ์ธก๊ฐ’์ด ์žˆ๋Š” ํ–‰ ์ œ๊ฑฐ
  • head() : ํ…Œ์ด๋ธ” ์ƒ๋‹จ ์ผ๋ถ€ ์ถœ๋ ฅ
  • corr : ์ƒ๊ด€๊ด€๊ณ„ ๋ถ„์„
  • ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„.corr(method='A') : ํ…Œ์ด๋ธ”์—์„œ A ๋ฐฉ๋ฒ•์œผ๋กœ ์ƒ๊ด€๊ณ„์ˆ˜ ๊ตฌํ•˜๊ธฐ
import pandas as pd                                    #pandas ์‚ฌ์šฉ ์„ ์–ธ 
titanic = pd.read_table('/content/train.csv',sep=',')  #titanic ํ…Œ์ด๋ธ” ๊ฐ€์ ธ์˜ค๊ธฐ
titanic = titanic.dropna()                             #๊ฒฐ์ธก๊ฐ’(null)์ด ์žˆ๋Š” ํ–‰์„ ์ œ๊ฑฐ
titanic.head()                                         #ํ…Œ์ด๋ธ”์˜ ์ƒ๋‹จ ์ผ๋ถ€๋ฅผ ์ถœ๋ ฅ 
corr = titanic.corr(method='pearson')                  #ํ”ผ์–ด์Šจ ๋ฐฉ๋ฒ•์œผ๋กœ ์ƒ๊ด€๊ณ„์ˆ˜ ๊ตฌํ•˜๊ธฐ
corr = corr[corr.Survived !=1]                         #์ƒ๊ด€๊ณ„์ˆ˜ Survived ์š”์†Œ๊ฐ€ 1(์ตœ๋Œ€)์ด ์•„๋‹Œ ์ˆ˜๋งŒ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
corr                                                   #์กฐํšŒ

 

 

(2) ๋ถ„์„ ๊ฒฐ๊ณผ ์‹œ๊ฐํ™”

 

  • ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„.drop([์‚ญ์ œํ•  ๋ ˆ์ด๋ธ” ๋ช…], axis='row/column') : ํ–‰/์—ด ์‚ญ์ œ
  • plot() : ๊ทธ๋ž˜ํ”„๋กœ ๋‚˜ํƒ€๋‚ด๊ธฐ
  • plot.bar() : ๋ง‰๋Œ€๊ทธ๋ž˜ํ”„๋กœ ๋‚˜ํƒ€๋‚ด๊ธฐ
import matplotlib.pyplot as plt                 #matplotlib ์‚ฌ์šฉ ์„ ์–ธ
corr = corr.drop(['PassengerId'], axis ='rows') #Passenger Id ํ–‰ ์ œ๊ฑฐ
corr['Survived'].plot.bar()                     #Survived ์—ด ์ง€์ • ํ›„ ๋ง‰๋Œ€๊ทธ๋ž˜ํ”„๋กœ ์กฐํšŒ

 

 

(3) ์ตœ์ข… ์ฝ”๋“œ

import pandas as pd                               #pandas ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์‚ฌ์šฉ ์„ ์–ธ 
import matplotlib.pyplot as plt                   #matplotlib ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์‚ฌ์šฉ ์„ ์–ธ 
titanic = pd.read_table('/train.csv',sep=',')     #titanic ํ…Œ์ด๋ธ” ๊ฐ€์ ธ์˜ค๊ธฐ

# 1.Null(๊ณต๋ฐฑ) ๋ฐ์ดํ„ฐ ํŒŒ์•…ํ•˜๊ธฐ
print(titanic.isnull().sum())

# 2. ๊ณต๋ฐฑ ๋ฐ์ดํ„ฐ ์ œ๊ฑฐํ•˜๊ธฐ
titanic = titanic.dropna()

#์ƒ๊ด€๊ณ„์ˆ˜ ๊ตฌํ•˜๊ธฐ
corr=titanic.corr(method='pearson')

#survived 1์ธ ์š”์†Œ ์ œ์™ธํ•˜๊ธฐ
corr = corr[corr.Survived !=1]

#passengerId ์—ด ์‚ญ์ œ ํ•˜๊ธฐ
corr = corr.drop(['PassengerId'], axis ='rows')

#์ƒ์กด์œจ ์ƒ๊ด€๊ด€๊ณ„ ๋ฐ” ๊ทธ๋ž˜ํ”„ ์ƒ์„ฑํ•˜๊ธฐ
corr['Survived'].plot.bar()

#x์ถ• ๋ ˆ์ด๋ธ” 45๋„ ํšŒ์ „ํ•˜๊ธฐ
plt.xticks(rotation=45)

 

๋ถ„์„ ๊ฒฐ๊ณผ
์„ฑ๋ณ„ 1์ผ์ˆ˜๋ก(=์—ฌ์„ฑ) ์ƒ์กด ํ™•๋ฅ ์ด ๋†’๋‹ค
์ขŒ์„ ๋“ฑ๊ธ‰ ๋‚ฎ์„์ˆ˜๋ก(1๋“ฑ๊ธ‰์— ๊ฐ€๊นŒ์šธ์ˆ˜๋ก) ์ƒ์กด ํ™•๋ฅ ์ด ๋†’๋‹ค
์š”๊ธˆ ๋น„์Œ€์ˆ˜๋ก ์ƒ์กด ํ™•๋ฅ ์ด ๋†’๋‹ค

 

 

 

4. ๋ฐ์ดํ„ฐ ๋ถ„์„์— ํ•„์š”ํ•œ ์‹œ๊ฐํ™” ํ•œ ๊ฑธ์Œ ๋”

 

(1) ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์ถ”๊ฐ€

 

  • numpy : ๋ฐ์ดํ„ฐ ์—ฐ์‚ฐ์„ ๋„์™€์ค€๋‹ค.
  • seaborn : matplotlib ์‹œ๊ฐํ™”๋ฅผ ๋„์™€์ค€๋‹ค.

 

 

(2) ์ตœ์ข… ์ฝ”๋“œ

 

  • hist(bins=๋ง‰๋Œ€๊ทธ๋ž˜ํ”„ ํญ, figsize=(๊ฐ€๋กœ๊ธธ์ด, ์„ธ๋กœ๊ธธ์ด), grid=True/False) : ํžˆ์Šคํ† ๊ทธ๋žจ ๋งŒ๋“ค๊ธฐ
  • cut(array, bins=[], label=[]) : ๋ฐ์ดํ„ฐ ๊ตฌ๊ฐ„ ๋‚˜๋ˆ„๊ธฐ
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

titanic = pd.read_table('/train.csv',sep=',')  #titanic ํ…Œ์ด๋ธ” ๊ฐ€์ ธ์˜ค๊ธฐ
titanic = titanic.dropna()                     #null๊ฐ’ ์ œ๊ฑฐ
titanic.head()
titanic.describe()                             #๋ฐ์ดํ„ฐ ํ†ต๊ณ„์น˜ ์š”์•ฝ

#์ฒซ ๋ฒˆ์งธ ๊ทธ๋ž˜ํ”„ - ๋‚˜์ด๋ณ„๋กœ ํžˆ์Šคํ† ๊ทธ๋žจ ๊ตฌํ•˜๊ธฐ
titanic['Age'].hist(bins=40,figsize=(18,8),grid=True)

#๋‚˜์ด๋ณ„ ๊ตฌ๋ถ„ ๋ฐ ๊ฐ ๋‚˜์ด๋ณ„ ์ƒ์กด์œจ ํ™•์ธ ํ•˜๊ธฐ
titanic['Age_cat'] = pd.cut(titanic['Age'],bins=[0,3,7,15,30,60,100],include_lowest=True,labels=['baby','children','teenage','young','adult','old'])

#์—ฐ๋ น๋Œ€๋ฅผ ๊ธฐ์ค€์œผ๋กœ ํ‰๊ท  ๊ฐ’ ๊ตฌํ•˜๊ธฐ
titanic.groupby('Age_cat').mean()

#๊ทธ๋ž˜ํ”„ ํฌ๊ธฐ ์„ค์ •
plt.figure(figsize=(14,5))

# ๋ฐ” ๊ทธ๋ž˜ํ”„ ๊ทธ๋ฆฌ๊ธฐ (x์ถ• = Age_cat, y์ถ• = Survived)
sns.barplot(x='Age_cat',y='Survived',data=titanic)

# ๋‘ ๋ฒˆ์งธ ๊ทธ๋ž˜ํ”„ ๋‚˜ํƒ€๋‚ด๊ธฐ
plt.show()