データ解析では,CSVファイルに提示されているデータを読み込んで解析を行うことが多い.
では,どのように読み込むのかを3パターン考える.
今回,有名な,IRISデータで検討を行う.
機械学習で有名なデータ.
IRISは「あやめ」の花を意味しており,UCI(カリフォルニア大学アーバイン校)から機械学習やデータマイニングの検討用データとして配布されている.
あやめの種類は以下のとおり.
このデータを以下の情報から分析する.
fp = open('iris.csv','r')
next(fp)
irisData = []
for line in fp:
record = line.strip().split('.')
irisData.append(record)
irisData
[['5', '1,3', '5,1', '4,0', '2,setosa'], ['4', '9,3', '0,1', '4,0', '2,setosa'], ['4', '7,3', '2,1', '3,0', '2,setosa'], ['4', '6,3', '1,1', '5,0', '2,setosa'], ['5', '0,3', '6,1', '4,0', '2,setosa'], ['5', '4,3', '9,1', '7,0', '4,setosa'], ['4', '6,3', '4,1', '4,0', '3,setosa'], ['5', '0,3', '4,1', '5,0', '2,setosa'], ['4', '4,2', '9,1', '4,0', '2,setosa'], ['4', '9,3', '1,1', '5,0', '1,setosa'], ['5', '4,3', '7,1', '5,0', '2,setosa'], ['4', '8,3', '4,1', '6,0', '2,setosa'], ['4', '8,3', '0,1', '4,0', '1,setosa'], ['4', '3,3', '0,1', '1,0', '1,setosa'], ['5', '8,4', '0,1', '2,0', '2,setosa'], ['5', '7,4', '4,1', '5,0', '4,setosa'], ['5', '4,3', '9,1', '3,0', '4,setosa'], ['5', '1,3', '5,1', '4,0', '3,setosa'], ['5', '7,3', '8,1', '7,0', '3,setosa'], ['5', '1,3', '8,1', '5,0', '3,setosa'], ['5', '4,3', '4,1', '7,0', '2,setosa'], ['5', '1,3', '7,1', '5,0', '4,setosa'], ['4', '6,3', '6,1', '0,0', '2,setosa'], ['5', '1,3', '3,1', '7,0', '5,setosa'], ['4', '8,3', '4,1', '9,0', '2,setosa'], ['5', '0,3', '0,1', '6,0', '2,setosa'], ['5', '0,3', '4,1', '6,0', '4,setosa'], ['5', '2,3', '5,1', '5,0', '2,setosa'], ['5', '2,3', '4,1', '4,0', '2,setosa'], ['4', '7,3', '2,1', '6,0', '2,setosa'], ['4', '8,3', '1,1', '6,0', '2,setosa'], ['5', '4,3', '4,1', '5,0', '4,setosa'], ['5', '2,4', '1,1', '5,0', '1,setosa'], ['5', '5,4', '2,1', '4,0', '2,setosa'], ['4', '9,3', '1,1', '5,0', '2,setosa'], ['5', '0,3', '2,1', '2,0', '2,setosa'], ['5', '5,3', '5,1', '3,0', '2,setosa'], ['4', '9,3', '6,1', '4,0', '1,setosa'], ['4', '4,3', '0,1', '3,0', '2,setosa'], ['5', '1,3', '4,1', '5,0', '2,setosa'], ['5', '0,3', '5,1', '3,0', '3,setosa'], ['4', '5,2', '3,1', '3,0', '3,setosa'], ['4', '4,3', '2,1', '3,0', '2,setosa'], ['5', '0,3', '5,1', '6,0', '6,setosa'], ['5', '1,3', '8,1', '9,0', '4,setosa'], ['4', '8,3', '0,1', '4,0', '3,setosa'], ['5', '1,3', '8,1', '6,0', '2,setosa'], ['4', '6,3', '2,1', '4,0', '2,setosa'], ['5', '3,3', '7,1', '5,0', '2,setosa'], ['5', '0,3', '3,1', '4,0', '2,setosa'], ['7', '0,3', '2,4', '7,1', '4,versicolor'], ['6', '4,3', '2,4', '5,1', '5,versicolor'], ['6', '9,3', '1,4', '9,1', '5,versicolor'], ['5', '5,2', '3,4', '0,1', '3,versicolor'], ['6', '5,2', '8,4', '6,1', '5,versicolor'], ['5', '7,2', '8,4', '5,1', '3,versicolor'], ['6', '3,3', '3,4', '7,1', '6,versicolor'], ['4', '9,2', '4,3', '3,1', '0,versicolor'], ['6', '6,2', '9,4', '6,1', '3,versicolor'], ['5', '2,2', '7,3', '9,1', '4,versicolor'], ['5', '0,2', '0,3', '5,1', '0,versicolor'], ['5', '9,3', '0,4', '2,1', '5,versicolor'], ['6', '0,2', '2,4', '0,1', '0,versicolor'], ['6', '1,2', '9,4', '7,1', '4,versicolor'], ['5', '6,2', '9,3', '6,1', '3,versicolor'], ['6', '7,3', '1,4', '4,1', '4,versicolor'], ['5', '6,3', '0,4', '5,1', '5,versicolor'], ['5', '8,2', '7,4', '1,1', '0,versicolor'], ['6', '2,2', '2,4', '5,1', '5,versicolor'], ['5', '6,2', '5,3', '9,1', '1,versicolor'], ['5', '9,3', '2,4', '8,1', '8,versicolor'], ['6', '1,2', '8,4', '0,1', '3,versicolor'], ['6', '3,2', '5,4', '9,1', '5,versicolor'], ['6', '1,2', '8,4', '7,1', '2,versicolor'], ['6', '4,2', '9,4', '3,1', '3,versicolor'], ['6', '6,3', '0,4', '4,1', '4,versicolor'], ['6', '8,2', '8,4', '8,1', '4,versicolor'], ['6', '7,3', '0,5', '0,1', '7,versicolor'], ['6', '0,2', '9,4', '5,1', '5,versicolor'], ['5', '7,2', '6,3', '5,1', '0,versicolor'], ['5', '5,2', '4,3', '8,1', '1,versicolor'], ['5', '5,2', '4,3', '7,1', '0,versicolor'], ['5', '8,2', '7,3', '9,1', '2,versicolor'], ['6', '0,2', '7,5', '1,1', '6,versicolor'], ['5', '4,3', '0,4', '5,1', '5,versicolor'], ['6', '0,3', '4,4', '5,1', '6,versicolor'], ['6', '7,3', '1,4', '7,1', '5,versicolor'], ['6', '3,2', '3,4', '4,1', '3,versicolor'], ['5', '6,3', '0,4', '1,1', '3,versicolor'], ['5', '5,2', '5,4', '0,1', '3,versicolor'], ['5', '5,2', '6,4', '4,1', '2,versicolor'], ['6', '1,3', '0,4', '6,1', '4,versicolor'], ['5', '8,2', '6,4', '0,1', '2,versicolor'], ['5', '0,2', '3,3', '3,1', '0,versicolor'], ['5', '6,2', '7,4', '2,1', '3,versicolor'], ['5', '7,3', '0,4', '2,1', '2,versicolor'], ['5', '7,2', '9,4', '2,1', '3,versicolor'], ['6', '2,2', '9,4', '3,1', '3,versicolor'], ['5', '1,2', '5,3', '0,1', '1,versicolor'], ['5', '7,2', '8,4', '1,1', '3,versicolor'], ['6', '3,3', '3,6', '0,2', '5,virginica'], ['5', '8,2', '7,5', '1,1', '9,virginica'], ['7', '1,3', '0,5', '9,2', '1,virginica'], ['6', '3,2', '9,5', '6,1', '8,virginica'], ['6', '5,3', '0,5', '8,2', '2,virginica'], ['7', '6,3', '0,6', '6,2', '1,virginica'], ['4', '9,2', '5,4', '5,1', '7,virginica'], ['7', '3,2', '9,6', '3,1', '8,virginica'], ['6', '7,2', '5,5', '8,1', '8,virginica'], ['7', '2,3', '6,6', '1,2', '5,virginica'], ['6', '5,3', '2,5', '1,2', '0,virginica'], ['6', '4,2', '7,5', '3,1', '9,virginica'], ['6', '8,3', '0,5', '5,2', '1,virginica'], ['5', '7,2', '5,5', '0,2', '0,virginica'], ['5', '8,2', '8,5', '1,2', '4,virginica'], ['6', '4,3', '2,5', '3,2', '3,virginica'], ['6', '5,3', '0,5', '5,1', '8,virginica'], ['7', '7,3', '8,6', '7,2', '2,virginica'], ['7', '7,2', '6,6', '9,2', '3,virginica'], ['6', '0,2', '2,5', '0,1', '5,virginica'], ['6', '9,3', '2,5', '7,2', '3,virginica'], ['5', '6,2', '8,4', '9,2', '0,virginica'], ['7', '7,2', '8,6', '7,2', '0,virginica'], ['6', '3,2', '7,4', '9,1', '8,virginica'], ['6', '7,3', '3,5', '7,2', '1,virginica'], ['7', '2,3', '2,6', '0,1', '8,virginica'], ['6', '2,2', '8,4', '8,1', '8,virginica'], ['6', '1,3', '0,4', '9,1', '8,virginica'], ['6', '4,2', '8,5', '6,2', '1,virginica'], ['7', '2,3', '0,5', '8,1', '6,virginica'], ['7', '4,2', '8,6', '1,1', '9,virginica'], ['7', '9,3', '8,6', '4,2', '0,virginica'], ['6', '4,2', '8,5', '6,2', '2,virginica'], ['6', '3,2', '8,5', '1,1', '5,virginica'], ['6', '1,2', '6,5', '6,1', '4,virginica'], ['7', '7,3', '0,6', '1,2', '3,virginica'], ['6', '3,3', '4,5', '6,2', '4,virginica'], ['6', '4,3', '1,5', '5,1', '8,virginica'], ['6', '0,3', '0,4', '8,1', '8,virginica'], ['6', '9,3', '1,5', '4,2', '1,virginica'], ['6', '7,3', '1,5', '6,2', '4,virginica'], ['6', '9,3', '1,5', '1,2', '3,virginica'], ['5', '8,2', '7,5', '1,1', '9,virginica'], ['6', '8,3', '2,5', '9,2', '3,virginica'], ['6', '7,3', '3,5', '7,2', '5,virginica'], ['6', '7,3', '0,5', '2,2', '3,virginica'], ['6', '3,2', '5,5', '0,1', '9,virginica'], ['6', '5,3', '0,5', '2,2', '0,virginica'], ['6', '2,3', '4,5', '4,2', '3,virginica'], ['5', '9,3', '0,5', '1,1', '8,virginica']]
上記について,IRIS内のデータが読み込まれました.
左から以下の情報を提示してます.
各行は,ひとつのあやめの計測を示しており,あやめは各種類でどのくらいのサンプルから計測されたのか,
そして,全体でどのくらいのサンプルで計測されているか,など確認して下さい.
(Excelなどですぐに確認できますね.)
配布したデータには,小数情報が掲載されてました.
この情報は,上記の各特徴について,それぞれ1へ正規化したデータとなってます.
実際に正規化すると,同じ情報が得られるので,確認して下さい.
import pandas as pd
irisData = pd.read_csv('iris.csv')
irisData
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
5 | 5.4 | 3.9 | 1.7 | 0.4 | setosa |
6 | 4.6 | 3.4 | 1.4 | 0.3 | setosa |
7 | 5.0 | 3.4 | 1.5 | 0.2 | setosa |
8 | 4.4 | 2.9 | 1.4 | 0.2 | setosa |
9 | 4.9 | 3.1 | 1.5 | 0.1 | setosa |
10 | 5.4 | 3.7 | 1.5 | 0.2 | setosa |
11 | 4.8 | 3.4 | 1.6 | 0.2 | setosa |
12 | 4.8 | 3.0 | 1.4 | 0.1 | setosa |
13 | 4.3 | 3.0 | 1.1 | 0.1 | setosa |
14 | 5.8 | 4.0 | 1.2 | 0.2 | setosa |
15 | 5.7 | 4.4 | 1.5 | 0.4 | setosa |
16 | 5.4 | 3.9 | 1.3 | 0.4 | setosa |
17 | 5.1 | 3.5 | 1.4 | 0.3 | setosa |
18 | 5.7 | 3.8 | 1.7 | 0.3 | setosa |
19 | 5.1 | 3.8 | 1.5 | 0.3 | setosa |
20 | 5.4 | 3.4 | 1.7 | 0.2 | setosa |
21 | 5.1 | 3.7 | 1.5 | 0.4 | setosa |
22 | 4.6 | 3.6 | 1.0 | 0.2 | setosa |
23 | 5.1 | 3.3 | 1.7 | 0.5 | setosa |
24 | 4.8 | 3.4 | 1.9 | 0.2 | setosa |
25 | 5.0 | 3.0 | 1.6 | 0.2 | setosa |
26 | 5.0 | 3.4 | 1.6 | 0.4 | setosa |
27 | 5.2 | 3.5 | 1.5 | 0.2 | setosa |
28 | 5.2 | 3.4 | 1.4 | 0.2 | setosa |
29 | 4.7 | 3.2 | 1.6 | 0.2 | setosa |
... | ... | ... | ... | ... | ... |
120 | 6.9 | 3.2 | 5.7 | 2.3 | virginica |
121 | 5.6 | 2.8 | 4.9 | 2.0 | virginica |
122 | 7.7 | 2.8 | 6.7 | 2.0 | virginica |
123 | 6.3 | 2.7 | 4.9 | 1.8 | virginica |
124 | 6.7 | 3.3 | 5.7 | 2.1 | virginica |
125 | 7.2 | 3.2 | 6.0 | 1.8 | virginica |
126 | 6.2 | 2.8 | 4.8 | 1.8 | virginica |
127 | 6.1 | 3.0 | 4.9 | 1.8 | virginica |
128 | 6.4 | 2.8 | 5.6 | 2.1 | virginica |
129 | 7.2 | 3.0 | 5.8 | 1.6 | virginica |
130 | 7.4 | 2.8 | 6.1 | 1.9 | virginica |
131 | 7.9 | 3.8 | 6.4 | 2.0 | virginica |
132 | 6.4 | 2.8 | 5.6 | 2.2 | virginica |
133 | 6.3 | 2.8 | 5.1 | 1.5 | virginica |
134 | 6.1 | 2.6 | 5.6 | 1.4 | virginica |
135 | 7.7 | 3.0 | 6.1 | 2.3 | virginica |
136 | 6.3 | 3.4 | 5.6 | 2.4 | virginica |
137 | 6.4 | 3.1 | 5.5 | 1.8 | virginica |
138 | 6.0 | 3.0 | 4.8 | 1.8 | virginica |
139 | 6.9 | 3.1 | 5.4 | 2.1 | virginica |
140 | 6.7 | 3.1 | 5.6 | 2.4 | virginica |
141 | 6.9 | 3.1 | 5.1 | 2.3 | virginica |
142 | 5.8 | 2.7 | 5.1 | 1.9 | virginica |
143 | 6.8 | 3.2 | 5.9 | 2.3 | virginica |
144 | 6.7 | 3.3 | 5.7 | 2.5 | virginica |
145 | 6.7 | 3.0 | 5.2 | 2.3 | virginica |
146 | 6.3 | 2.5 | 5.0 | 1.9 | virginica |
147 | 6.5 | 3.0 | 5.2 | 2.0 | virginica |
148 | 6.2 | 3.4 | 5.4 | 2.3 | virginica |
149 | 5.9 | 3.0 | 5.1 | 1.8 | virginica |
150 rows × 5 columns
一行目に注目すると,各特徴の項目が提示されています.
このように,表形式のように,データを提示することが可能なのです.
ここで,irisDataのオブジェクトに, 「.head」という処理を行ってみます.
StdOutベースで出力されていることを確認して下さい.
irisData.head
<bound method NDFrame.head of sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 setosa 1 4.9 3.0 1.4 0.2 setosa 2 4.7 3.2 1.3 0.2 setosa 3 4.6 3.1 1.5 0.2 setosa 4 5.0 3.6 1.4 0.2 setosa 5 5.4 3.9 1.7 0.4 setosa 6 4.6 3.4 1.4 0.3 setosa 7 5.0 3.4 1.5 0.2 setosa 8 4.4 2.9 1.4 0.2 setosa 9 4.9 3.1 1.5 0.1 setosa 10 5.4 3.7 1.5 0.2 setosa 11 4.8 3.4 1.6 0.2 setosa 12 4.8 3.0 1.4 0.1 setosa 13 4.3 3.0 1.1 0.1 setosa 14 5.8 4.0 1.2 0.2 setosa 15 5.7 4.4 1.5 0.4 setosa 16 5.4 3.9 1.3 0.4 setosa 17 5.1 3.5 1.4 0.3 setosa 18 5.7 3.8 1.7 0.3 setosa 19 5.1 3.8 1.5 0.3 setosa 20 5.4 3.4 1.7 0.2 setosa 21 5.1 3.7 1.5 0.4 setosa 22 4.6 3.6 1.0 0.2 setosa 23 5.1 3.3 1.7 0.5 setosa 24 4.8 3.4 1.9 0.2 setosa 25 5.0 3.0 1.6 0.2 setosa 26 5.0 3.4 1.6 0.4 setosa 27 5.2 3.5 1.5 0.2 setosa 28 5.2 3.4 1.4 0.2 setosa 29 4.7 3.2 1.6 0.2 setosa .. ... ... ... ... ... 120 6.9 3.2 5.7 2.3 virginica 121 5.6 2.8 4.9 2.0 virginica 122 7.7 2.8 6.7 2.0 virginica 123 6.3 2.7 4.9 1.8 virginica 124 6.7 3.3 5.7 2.1 virginica 125 7.2 3.2 6.0 1.8 virginica 126 6.2 2.8 4.8 1.8 virginica 127 6.1 3.0 4.9 1.8 virginica 128 6.4 2.8 5.6 2.1 virginica 129 7.2 3.0 5.8 1.6 virginica 130 7.4 2.8 6.1 1.9 virginica 131 7.9 3.8 6.4 2.0 virginica 132 6.4 2.8 5.6 2.2 virginica 133 6.3 2.8 5.1 1.5 virginica 134 6.1 2.6 5.6 1.4 virginica 135 7.7 3.0 6.1 2.3 virginica 136 6.3 3.4 5.6 2.4 virginica 137 6.4 3.1 5.5 1.8 virginica 138 6.0 3.0 4.8 1.8 virginica 139 6.9 3.1 5.4 2.1 virginica 140 6.7 3.1 5.6 2.4 virginica 141 6.9 3.1 5.1 2.3 virginica 142 5.8 2.7 5.1 1.9 virginica 143 6.8 3.2 5.9 2.3 virginica 144 6.7 3.3 5.7 2.5 virginica 145 6.7 3.0 5.2 2.3 virginica 146 6.3 2.5 5.0 1.9 virginica 147 6.5 3.0 5.2 2.0 virginica 148 6.2 3.4 5.4 2.3 virginica 149 5.9 3.0 5.1 1.8 virginica [150 rows x 5 columns]>
Pandasがこのように,csvファイルを簡単に読み込むことが分かりました.
pandas.read_csvは,データの型を自動で判別して変換することが可能です
Numpyにも,numpy.loadtxtによって簡単にデータを読み込むことができますが,ndarrayへ代入するために,すべてのカラムが同じ型であることが必要です.
csvファイルに含まれているデータは,すべて同じ型とは限らないため,pandas.read_cs()が使いやすいでしょう.
しかし,やはりNumpyはとっても使い勝手のよいライブラリですので,どうしても使いたい場合があります.
この場合は,以下のように変換してください.
import numpy as np
cust_array = np.array([irisData['sepal_length'].tolist(),
irisData['sepal_width'].tolist(),
irisData['petal_length'].tolist(),
irisData['petal_width'].tolist(),
], np.float)
#転置行列へ変換
cust_array =cust_array.T
cust_array
array([[ 5.1, 3.5, 1.4, 0.2], [ 4.9, 3. , 1.4, 0.2], [ 4.7, 3.2, 1.3, 0.2], [ 4.6, 3.1, 1.5, 0.2], [ 5. , 3.6, 1.4, 0.2], [ 5.4, 3.9, 1.7, 0.4], [ 4.6, 3.4, 1.4, 0.3], [ 5. , 3.4, 1.5, 0.2], [ 4.4, 2.9, 1.4, 0.2], [ 4.9, 3.1, 1.5, 0.1], [ 5.4, 3.7, 1.5, 0.2], [ 4.8, 3.4, 1.6, 0.2], [ 4.8, 3. , 1.4, 0.1], [ 4.3, 3. , 1.1, 0.1], [ 5.8, 4. , 1.2, 0.2], [ 5.7, 4.4, 1.5, 0.4], [ 5.4, 3.9, 1.3, 0.4], [ 5.1, 3.5, 1.4, 0.3], [ 5.7, 3.8, 1.7, 0.3], [ 5.1, 3.8, 1.5, 0.3], [ 5.4, 3.4, 1.7, 0.2], [ 5.1, 3.7, 1.5, 0.4], [ 4.6, 3.6, 1. , 0.2], [ 5.1, 3.3, 1.7, 0.5], [ 4.8, 3.4, 1.9, 0.2], [ 5. , 3. , 1.6, 0.2], [ 5. , 3.4, 1.6, 0.4], [ 5.2, 3.5, 1.5, 0.2], [ 5.2, 3.4, 1.4, 0.2], [ 4.7, 3.2, 1.6, 0.2], [ 4.8, 3.1, 1.6, 0.2], [ 5.4, 3.4, 1.5, 0.4], [ 5.2, 4.1, 1.5, 0.1], [ 5.5, 4.2, 1.4, 0.2], [ 4.9, 3.1, 1.5, 0.2], [ 5. , 3.2, 1.2, 0.2], [ 5.5, 3.5, 1.3, 0.2], [ 4.9, 3.6, 1.4, 0.1], [ 4.4, 3. , 1.3, 0.2], [ 5.1, 3.4, 1.5, 0.2], [ 5. , 3.5, 1.3, 0.3], [ 4.5, 2.3, 1.3, 0.3], [ 4.4, 3.2, 1.3, 0.2], [ 5. , 3.5, 1.6, 0.6], [ 5.1, 3.8, 1.9, 0.4], [ 4.8, 3. , 1.4, 0.3], [ 5.1, 3.8, 1.6, 0.2], [ 4.6, 3.2, 1.4, 0.2], [ 5.3, 3.7, 1.5, 0.2], [ 5. , 3.3, 1.4, 0.2], [ 7. , 3.2, 4.7, 1.4], [ 6.4, 3.2, 4.5, 1.5], [ 6.9, 3.1, 4.9, 1.5], [ 5.5, 2.3, 4. , 1.3], [ 6.5, 2.8, 4.6, 1.5], [ 5.7, 2.8, 4.5, 1.3], [ 6.3, 3.3, 4.7, 1.6], [ 4.9, 2.4, 3.3, 1. ], [ 6.6, 2.9, 4.6, 1.3], [ 5.2, 2.7, 3.9, 1.4], [ 5. , 2. , 3.5, 1. ], [ 5.9, 3. , 4.2, 1.5], [ 6. , 2.2, 4. , 1. ], [ 6.1, 2.9, 4.7, 1.4], [ 5.6, 2.9, 3.6, 1.3], [ 6.7, 3.1, 4.4, 1.4], [ 5.6, 3. , 4.5, 1.5], [ 5.8, 2.7, 4.1, 1. ], [ 6.2, 2.2, 4.5, 1.5], [ 5.6, 2.5, 3.9, 1.1], [ 5.9, 3.2, 4.8, 1.8], [ 6.1, 2.8, 4. , 1.3], [ 6.3, 2.5, 4.9, 1.5], [ 6.1, 2.8, 4.7, 1.2], [ 6.4, 2.9, 4.3, 1.3], [ 6.6, 3. , 4.4, 1.4], [ 6.8, 2.8, 4.8, 1.4], [ 6.7, 3. , 5. , 1.7], [ 6. , 2.9, 4.5, 1.5], [ 5.7, 2.6, 3.5, 1. ], [ 5.5, 2.4, 3.8, 1.1], [ 5.5, 2.4, 3.7, 1. ], [ 5.8, 2.7, 3.9, 1.2], [ 6. , 2.7, 5.1, 1.6], [ 5.4, 3. , 4.5, 1.5], [ 6. , 3.4, 4.5, 1.6], [ 6.7, 3.1, 4.7, 1.5], [ 6.3, 2.3, 4.4, 1.3], [ 5.6, 3. , 4.1, 1.3], [ 5.5, 2.5, 4. , 1.3], [ 5.5, 2.6, 4.4, 1.2], [ 6.1, 3. , 4.6, 1.4], [ 5.8, 2.6, 4. , 1.2], [ 5. , 2.3, 3.3, 1. ], [ 5.6, 2.7, 4.2, 1.3], [ 5.7, 3. , 4.2, 1.2], [ 5.7, 2.9, 4.2, 1.3], [ 6.2, 2.9, 4.3, 1.3], [ 5.1, 2.5, 3. , 1.1], [ 5.7, 2.8, 4.1, 1.3], [ 6.3, 3.3, 6. , 2.5], [ 5.8, 2.7, 5.1, 1.9], [ 7.1, 3. , 5.9, 2.1], [ 6.3, 2.9, 5.6, 1.8], [ 6.5, 3. , 5.8, 2.2], [ 7.6, 3. , 6.6, 2.1], [ 4.9, 2.5, 4.5, 1.7], [ 7.3, 2.9, 6.3, 1.8], [ 6.7, 2.5, 5.8, 1.8], [ 7.2, 3.6, 6.1, 2.5], [ 6.5, 3.2, 5.1, 2. ], [ 6.4, 2.7, 5.3, 1.9], [ 6.8, 3. , 5.5, 2.1], [ 5.7, 2.5, 5. , 2. ], [ 5.8, 2.8, 5.1, 2.4], [ 6.4, 3.2, 5.3, 2.3], [ 6.5, 3. , 5.5, 1.8], [ 7.7, 3.8, 6.7, 2.2], [ 7.7, 2.6, 6.9, 2.3], [ 6. , 2.2, 5. , 1.5], [ 6.9, 3.2, 5.7, 2.3], [ 5.6, 2.8, 4.9, 2. ], [ 7.7, 2.8, 6.7, 2. ], [ 6.3, 2.7, 4.9, 1.8], [ 6.7, 3.3, 5.7, 2.1], [ 7.2, 3.2, 6. , 1.8], [ 6.2, 2.8, 4.8, 1.8], [ 6.1, 3. , 4.9, 1.8], [ 6.4, 2.8, 5.6, 2.1], [ 7.2, 3. , 5.8, 1.6], [ 7.4, 2.8, 6.1, 1.9], [ 7.9, 3.8, 6.4, 2. ], [ 6.4, 2.8, 5.6, 2.2], [ 6.3, 2.8, 5.1, 1.5], [ 6.1, 2.6, 5.6, 1.4], [ 7.7, 3. , 6.1, 2.3], [ 6.3, 3.4, 5.6, 2.4], [ 6.4, 3.1, 5.5, 1.8], [ 6. , 3. , 4.8, 1.8], [ 6.9, 3.1, 5.4, 2.1], [ 6.7, 3.1, 5.6, 2.4], [ 6.9, 3.1, 5.1, 2.3], [ 5.8, 2.7, 5.1, 1.9], [ 6.8, 3.2, 5.9, 2.3], [ 6.7, 3.3, 5.7, 2.5], [ 6.7, 3. , 5.2, 2.3], [ 6.3, 2.5, 5. , 1.9], [ 6.5, 3. , 5.2, 2. ], [ 6.2, 3.4, 5.4, 2.3], [ 5.9, 3. , 5.1, 1.8]])
直接読み込むのであれば,こちらが楽に読み込めます.
np.genfromtxt
cust2array = np.genfromtxt('iris.csv', delimiter=',', dtype=None, names=('sepal length', 'sepal_width', 'petal_length', 'petal_width', 'species'))
cust2array
array([ (b'sepal_length', b'sepal_width', b'petal_length', b'petal_width', b'species'), (b'5.1', b'3.5', b'1.4', b'0.2', b'setosa'), (b'4.9', b'3.0', b'1.4', b'0.2', b'setosa'), (b'4.7', b'3.2', b'1.3', b'0.2', b'setosa'), (b'4.6', b'3.1', b'1.5', b'0.2', b'setosa'), (b'5.0', b'3.6', b'1.4', b'0.2', b'setosa'), (b'5.4', b'3.9', b'1.7', b'0.4', b'setosa'), (b'4.6', b'3.4', b'1.4', b'0.3', b'setosa'), (b'5.0', b'3.4', b'1.5', b'0.2', b'setosa'), (b'4.4', b'2.9', b'1.4', b'0.2', b'setosa'), (b'4.9', b'3.1', b'1.5', b'0.1', b'setosa'), (b'5.4', b'3.7', b'1.5', b'0.2', b'setosa'), (b'4.8', b'3.4', b'1.6', b'0.2', b'setosa'), (b'4.8', b'3.0', b'1.4', b'0.1', b'setosa'), (b'4.3', b'3.0', b'1.1', b'0.1', b'setosa'), (b'5.8', b'4.0', b'1.2', b'0.2', b'setosa'), (b'5.7', b'4.4', b'1.5', b'0.4', b'setosa'), (b'5.4', b'3.9', b'1.3', b'0.4', b'setosa'), (b'5.1', b'3.5', b'1.4', b'0.3', b'setosa'), (b'5.7', b'3.8', b'1.7', b'0.3', b'setosa'), (b'5.1', b'3.8', b'1.5', b'0.3', b'setosa'), (b'5.4', b'3.4', b'1.7', b'0.2', b'setosa'), (b'5.1', b'3.7', b'1.5', b'0.4', b'setosa'), (b'4.6', b'3.6', b'1.0', b'0.2', b'setosa'), (b'5.1', b'3.3', b'1.7', b'0.5', b'setosa'), (b'4.8', b'3.4', b'1.9', b'0.2', b'setosa'), (b'5.0', b'3.0', b'1.6', b'0.2', b'setosa'), (b'5.0', b'3.4', b'1.6', b'0.4', b'setosa'), (b'5.2', b'3.5', b'1.5', b'0.2', b'setosa'), (b'5.2', b'3.4', b'1.4', b'0.2', b'setosa'), (b'4.7', b'3.2', b'1.6', b'0.2', b'setosa'), (b'4.8', b'3.1', b'1.6', b'0.2', b'setosa'), (b'5.4', b'3.4', b'1.5', b'0.4', b'setosa'), (b'5.2', b'4.1', b'1.5', b'0.1', b'setosa'), (b'5.5', b'4.2', b'1.4', b'0.2', b'setosa'), (b'4.9', b'3.1', b'1.5', b'0.2', b'setosa'), (b'5.0', b'3.2', b'1.2', b'0.2', b'setosa'), (b'5.5', b'3.5', b'1.3', b'0.2', b'setosa'), (b'4.9', b'3.6', b'1.4', b'0.1', b'setosa'), (b'4.4', b'3.0', b'1.3', b'0.2', b'setosa'), (b'5.1', b'3.4', b'1.5', b'0.2', b'setosa'), (b'5.0', b'3.5', b'1.3', b'0.3', b'setosa'), (b'4.5', b'2.3', b'1.3', b'0.3', b'setosa'), (b'4.4', b'3.2', b'1.3', b'0.2', b'setosa'), (b'5.0', b'3.5', b'1.6', b'0.6', b'setosa'), (b'5.1', b'3.8', b'1.9', b'0.4', b'setosa'), (b'4.8', b'3.0', b'1.4', b'0.3', b'setosa'), (b'5.1', b'3.8', b'1.6', b'0.2', b'setosa'), (b'4.6', b'3.2', b'1.4', b'0.2', b'setosa'), (b'5.3', b'3.7', b'1.5', b'0.2', b'setosa'), (b'5.0', b'3.3', b'1.4', b'0.2', b'setosa'), (b'7.0', b'3.2', b'4.7', b'1.4', b'versicolor'), (b'6.4', b'3.2', b'4.5', b'1.5', b'versicolor'), (b'6.9', b'3.1', b'4.9', b'1.5', b'versicolor'), (b'5.5', b'2.3', b'4.0', b'1.3', b'versicolor'), (b'6.5', b'2.8', b'4.6', b'1.5', b'versicolor'), (b'5.7', b'2.8', b'4.5', b'1.3', b'versicolor'), (b'6.3', b'3.3', b'4.7', b'1.6', b'versicolor'), (b'4.9', b'2.4', b'3.3', b'1.0', b'versicolor'), (b'6.6', b'2.9', b'4.6', b'1.3', b'versicolor'), (b'5.2', b'2.7', b'3.9', b'1.4', b'versicolor'), (b'5.0', b'2.0', b'3.5', b'1.0', b'versicolor'), (b'5.9', b'3.0', b'4.2', b'1.5', b'versicolor'), (b'6.0', b'2.2', b'4.0', b'1.0', b'versicolor'), (b'6.1', b'2.9', b'4.7', b'1.4', b'versicolor'), (b'5.6', b'2.9', b'3.6', b'1.3', b'versicolor'), (b'6.7', b'3.1', b'4.4', b'1.4', b'versicolor'), (b'5.6', b'3.0', b'4.5', b'1.5', b'versicolor'), (b'5.8', b'2.7', b'4.1', b'1.0', b'versicolor'), (b'6.2', b'2.2', b'4.5', b'1.5', b'versicolor'), (b'5.6', b'2.5', b'3.9', b'1.1', b'versicolor'), (b'5.9', b'3.2', b'4.8', b'1.8', b'versicolor'), (b'6.1', b'2.8', b'4.0', b'1.3', b'versicolor'), (b'6.3', b'2.5', b'4.9', b'1.5', b'versicolor'), (b'6.1', b'2.8', b'4.7', b'1.2', b'versicolor'), (b'6.4', b'2.9', b'4.3', b'1.3', b'versicolor'), (b'6.6', b'3.0', b'4.4', b'1.4', b'versicolor'), (b'6.8', b'2.8', b'4.8', b'1.4', b'versicolor'), (b'6.7', b'3.0', b'5.0', b'1.7', b'versicolor'), (b'6.0', b'2.9', b'4.5', b'1.5', b'versicolor'), (b'5.7', b'2.6', b'3.5', b'1.0', b'versicolor'), (b'5.5', b'2.4', b'3.8', b'1.1', b'versicolor'), (b'5.5', b'2.4', b'3.7', b'1.0', b'versicolor'), (b'5.8', b'2.7', b'3.9', b'1.2', b'versicolor'), (b'6.0', b'2.7', b'5.1', b'1.6', b'versicolor'), (b'5.4', b'3.0', b'4.5', b'1.5', b'versicolor'), (b'6.0', b'3.4', b'4.5', b'1.6', b'versicolor'), (b'6.7', b'3.1', b'4.7', b'1.5', b'versicolor'), (b'6.3', b'2.3', b'4.4', b'1.3', b'versicolor'), (b'5.6', b'3.0', b'4.1', b'1.3', b'versicolor'), (b'5.5', b'2.5', b'4.0', b'1.3', b'versicolor'), (b'5.5', b'2.6', b'4.4', b'1.2', b'versicolor'), (b'6.1', b'3.0', b'4.6', b'1.4', b'versicolor'), (b'5.8', b'2.6', b'4.0', b'1.2', b'versicolor'), (b'5.0', b'2.3', b'3.3', b'1.0', b'versicolor'), (b'5.6', b'2.7', b'4.2', b'1.3', b'versicolor'), (b'5.7', b'3.0', b'4.2', b'1.2', b'versicolor'), (b'5.7', b'2.9', b'4.2', b'1.3', b'versicolor'), (b'6.2', b'2.9', b'4.3', b'1.3', b'versicolor'), (b'5.1', b'2.5', b'3.0', b'1.1', b'versicolor'), (b'5.7', b'2.8', b'4.1', b'1.3', b'versicolor'), (b'6.3', b'3.3', b'6.0', b'2.5', b'virginica'), (b'5.8', b'2.7', b'5.1', b'1.9', b'virginica'), (b'7.1', b'3.0', b'5.9', b'2.1', b'virginica'), (b'6.3', b'2.9', b'5.6', b'1.8', b'virginica'), (b'6.5', b'3.0', b'5.8', b'2.2', b'virginica'), (b'7.6', b'3.0', b'6.6', b'2.1', b'virginica'), (b'4.9', b'2.5', b'4.5', b'1.7', b'virginica'), (b'7.3', b'2.9', b'6.3', b'1.8', b'virginica'), (b'6.7', b'2.5', b'5.8', b'1.8', b'virginica'), (b'7.2', b'3.6', b'6.1', b'2.5', b'virginica'), (b'6.5', b'3.2', b'5.1', b'2.0', b'virginica'), (b'6.4', b'2.7', b'5.3', b'1.9', b'virginica'), (b'6.8', b'3.0', b'5.5', b'2.1', b'virginica'), (b'5.7', b'2.5', b'5.0', b'2.0', b'virginica'), (b'5.8', b'2.8', b'5.1', b'2.4', b'virginica'), (b'6.4', b'3.2', b'5.3', b'2.3', b'virginica'), (b'6.5', b'3.0', b'5.5', b'1.8', b'virginica'), (b'7.7', b'3.8', b'6.7', b'2.2', b'virginica'), (b'7.7', b'2.6', b'6.9', b'2.3', b'virginica'), (b'6.0', b'2.2', b'5.0', b'1.5', b'virginica'), (b'6.9', b'3.2', b'5.7', b'2.3', b'virginica'), (b'5.6', b'2.8', b'4.9', b'2.0', b'virginica'), (b'7.7', b'2.8', b'6.7', b'2.0', b'virginica'), (b'6.3', b'2.7', b'4.9', b'1.8', b'virginica'), (b'6.7', b'3.3', b'5.7', b'2.1', b'virginica'), (b'7.2', b'3.2', b'6.0', b'1.8', b'virginica'), (b'6.2', b'2.8', b'4.8', b'1.8', b'virginica'), (b'6.1', b'3.0', b'4.9', b'1.8', b'virginica'), (b'6.4', b'2.8', b'5.6', b'2.1', b'virginica'), (b'7.2', b'3.0', b'5.8', b'1.6', b'virginica'), (b'7.4', b'2.8', b'6.1', b'1.9', b'virginica'), (b'7.9', b'3.8', b'6.4', b'2.0', b'virginica'), (b'6.4', b'2.8', b'5.6', b'2.2', b'virginica'), (b'6.3', b'2.8', b'5.1', b'1.5', b'virginica'), (b'6.1', b'2.6', b'5.6', b'1.4', b'virginica'), (b'7.7', b'3.0', b'6.1', b'2.3', b'virginica'), (b'6.3', b'3.4', b'5.6', b'2.4', b'virginica'), (b'6.4', b'3.1', b'5.5', b'1.8', b'virginica'), (b'6.0', b'3.0', b'4.8', b'1.8', b'virginica'), (b'6.9', b'3.1', b'5.4', b'2.1', b'virginica'), (b'6.7', b'3.1', b'5.6', b'2.4', b'virginica'), (b'6.9', b'3.1', b'5.1', b'2.3', b'virginica'), (b'5.8', b'2.7', b'5.1', b'1.9', b'virginica'), (b'6.8', b'3.2', b'5.9', b'2.3', b'virginica'), (b'6.7', b'3.3', b'5.7', b'2.5', b'virginica'), (b'6.7', b'3.0', b'5.2', b'2.3', b'virginica'), (b'6.3', b'2.5', b'5.0', b'1.9', b'virginica'), (b'6.5', b'3.0', b'5.2', b'2.0', b'virginica'), (b'6.2', b'3.4', b'5.4', b'2.3', b'virginica'), (b'5.9', b'3.0', b'5.1', b'1.8', b'virginica')], dtype=[('sepal_length', 'S12'), ('sepal_width', 'S11'), ('petal_length', 'S12'), ('petal_width', 'S11'), ('species', 'S10')])
# 行数の確認
len(irisData)
150
# 次元数の確認
irisData.shape
(150, 5)
# カラム情報の一覧
irisData.info() #カラム名とその型の一覧
<class 'pandas.core.frame.DataFrame'> RangeIndex: 150 entries, 0 to 149 Data columns (total 5 columns): sepal_length 150 non-null float64 sepal_width 150 non-null float64 petal_length 150 non-null float64 petal_width 150 non-null float64 species 150 non-null object dtypes: float64(4), object(1) memory usage: 5.9+ KB
# 各列の基礎統計量の確認
irisData.describe() # 平均、分散、4分位など
sepal_length | sepal_width | petal_length | petal_width | |
---|---|---|---|---|
count | 150.000000 | 150.000000 | 150.000000 | 150.000000 |
mean | 5.843333 | 3.057333 | 3.758000 | 1.199333 |
std | 0.828066 | 0.435866 | 1.765298 | 0.762238 |
min | 4.300000 | 2.000000 | 1.000000 | 0.100000 |
25% | 5.100000 | 2.800000 | 1.600000 | 0.300000 |
50% | 5.800000 | 3.000000 | 4.350000 | 1.300000 |
75% | 6.400000 | 3.300000 | 5.100000 | 1.800000 |
max | 7.900000 | 4.400000 | 6.900000 | 2.500000 |
ところで,pandasライブラリで読み込んだインスタンスはDataframeというインスタンスで処理すると非常に便利です.
このDataframeは大変便利で,各基本統計量も見ることができますし,データの各情報を確認することもできます.
Datadframeを利用するため,まずは以下の処理を行いましょう.
irisDF = pd.DataFrame(irisData)
irisDF
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
5 | 5.4 | 3.9 | 1.7 | 0.4 | setosa |
6 | 4.6 | 3.4 | 1.4 | 0.3 | setosa |
7 | 5.0 | 3.4 | 1.5 | 0.2 | setosa |
8 | 4.4 | 2.9 | 1.4 | 0.2 | setosa |
9 | 4.9 | 3.1 | 1.5 | 0.1 | setosa |
10 | 5.4 | 3.7 | 1.5 | 0.2 | setosa |
11 | 4.8 | 3.4 | 1.6 | 0.2 | setosa |
12 | 4.8 | 3.0 | 1.4 | 0.1 | setosa |
13 | 4.3 | 3.0 | 1.1 | 0.1 | setosa |
14 | 5.8 | 4.0 | 1.2 | 0.2 | setosa |
15 | 5.7 | 4.4 | 1.5 | 0.4 | setosa |
16 | 5.4 | 3.9 | 1.3 | 0.4 | setosa |
17 | 5.1 | 3.5 | 1.4 | 0.3 | setosa |
18 | 5.7 | 3.8 | 1.7 | 0.3 | setosa |
19 | 5.1 | 3.8 | 1.5 | 0.3 | setosa |
20 | 5.4 | 3.4 | 1.7 | 0.2 | setosa |
21 | 5.1 | 3.7 | 1.5 | 0.4 | setosa |
22 | 4.6 | 3.6 | 1.0 | 0.2 | setosa |
23 | 5.1 | 3.3 | 1.7 | 0.5 | setosa |
24 | 4.8 | 3.4 | 1.9 | 0.2 | setosa |
25 | 5.0 | 3.0 | 1.6 | 0.2 | setosa |
26 | 5.0 | 3.4 | 1.6 | 0.4 | setosa |
27 | 5.2 | 3.5 | 1.5 | 0.2 | setosa |
28 | 5.2 | 3.4 | 1.4 | 0.2 | setosa |
29 | 4.7 | 3.2 | 1.6 | 0.2 | setosa |
... | ... | ... | ... | ... | ... |
120 | 6.9 | 3.2 | 5.7 | 2.3 | virginica |
121 | 5.6 | 2.8 | 4.9 | 2.0 | virginica |
122 | 7.7 | 2.8 | 6.7 | 2.0 | virginica |
123 | 6.3 | 2.7 | 4.9 | 1.8 | virginica |
124 | 6.7 | 3.3 | 5.7 | 2.1 | virginica |
125 | 7.2 | 3.2 | 6.0 | 1.8 | virginica |
126 | 6.2 | 2.8 | 4.8 | 1.8 | virginica |
127 | 6.1 | 3.0 | 4.9 | 1.8 | virginica |
128 | 6.4 | 2.8 | 5.6 | 2.1 | virginica |
129 | 7.2 | 3.0 | 5.8 | 1.6 | virginica |
130 | 7.4 | 2.8 | 6.1 | 1.9 | virginica |
131 | 7.9 | 3.8 | 6.4 | 2.0 | virginica |
132 | 6.4 | 2.8 | 5.6 | 2.2 | virginica |
133 | 6.3 | 2.8 | 5.1 | 1.5 | virginica |
134 | 6.1 | 2.6 | 5.6 | 1.4 | virginica |
135 | 7.7 | 3.0 | 6.1 | 2.3 | virginica |
136 | 6.3 | 3.4 | 5.6 | 2.4 | virginica |
137 | 6.4 | 3.1 | 5.5 | 1.8 | virginica |
138 | 6.0 | 3.0 | 4.8 | 1.8 | virginica |
139 | 6.9 | 3.1 | 5.4 | 2.1 | virginica |
140 | 6.7 | 3.1 | 5.6 | 2.4 | virginica |
141 | 6.9 | 3.1 | 5.1 | 2.3 | virginica |
142 | 5.8 | 2.7 | 5.1 | 1.9 | virginica |
143 | 6.8 | 3.2 | 5.9 | 2.3 | virginica |
144 | 6.7 | 3.3 | 5.7 | 2.5 | virginica |
145 | 6.7 | 3.0 | 5.2 | 2.3 | virginica |
146 | 6.3 | 2.5 | 5.0 | 1.9 | virginica |
147 | 6.5 | 3.0 | 5.2 | 2.0 | virginica |
148 | 6.2 | 3.4 | 5.4 | 2.3 | virginica |
149 | 5.9 | 3.0 | 5.1 | 1.8 | virginica |
150 rows × 5 columns
このデータに含まれているあやめのデータは以下の通りでした.
ということは,上記それぞれで情報を確認することができることになります.
まずは,データの分離から考えましょ.
setosa = irisData[irisData['species']=='setosa']
versicolor = irisData[irisData['species']=='versicolor']
virginica = irisData[irisData['species']=='virginica']
では,各基本統計量を調べてみます
setosa.describe()
sepal_length | sepal_width | petal_length | petal_width | |
---|---|---|---|---|
count | 50.00000 | 50.000000 | 50.000000 | 50.000000 |
mean | 5.00600 | 3.428000 | 1.462000 | 0.246000 |
std | 0.35249 | 0.379064 | 0.173664 | 0.105386 |
min | 4.30000 | 2.300000 | 1.000000 | 0.100000 |
25% | 4.80000 | 3.200000 | 1.400000 | 0.200000 |
50% | 5.00000 | 3.400000 | 1.500000 | 0.200000 |
75% | 5.20000 | 3.675000 | 1.575000 | 0.300000 |
max | 5.80000 | 4.400000 | 1.900000 | 0.600000 |
versicolor.describe()
sepal_length | sepal_width | petal_length | petal_width | |
---|---|---|---|---|
count | 50.000000 | 50.000000 | 50.000000 | 50.000000 |
mean | 5.936000 | 2.770000 | 4.260000 | 1.326000 |
std | 0.516171 | 0.313798 | 0.469911 | 0.197753 |
min | 4.900000 | 2.000000 | 3.000000 | 1.000000 |
25% | 5.600000 | 2.525000 | 4.000000 | 1.200000 |
50% | 5.900000 | 2.800000 | 4.350000 | 1.300000 |
75% | 6.300000 | 3.000000 | 4.600000 | 1.500000 |
max | 7.000000 | 3.400000 | 5.100000 | 1.800000 |
virginica.describe()
sepal_length | sepal_width | petal_length | petal_width | |
---|---|---|---|---|
count | 50.00000 | 50.000000 | 50.000000 | 50.00000 |
mean | 6.58800 | 2.974000 | 5.552000 | 2.02600 |
std | 0.63588 | 0.322497 | 0.551895 | 0.27465 |
min | 4.90000 | 2.200000 | 4.500000 | 1.40000 |
25% | 6.22500 | 2.800000 | 5.100000 | 1.80000 |
50% | 6.50000 | 3.000000 | 5.550000 | 2.00000 |
75% | 6.90000 | 3.175000 | 5.875000 | 2.30000 |
max | 7.90000 | 3.800000 | 6.900000 | 2.50000 |
その他,以下のようにして,各統計量を見ることができますので,確認して下さい.
setosa.sum() #合計 setosa.mean() #平均 setosa.median() #中央値 setosa.skew() #歪度 setosa.kurt() #尖度 setosa.min() #最小値 setosa.max() #最大値 setosa.corr() #相関係数 setosa.cov() #分散 setosa.std() #標準偏差 setosa.cov() #共分散
Scikit-learnライブラリは,Pythonで機械学習を行うために用意された定番のライブラリです
与えられたデータに対し,学習,処理結果の評価,について幅広く網羅されており,回帰分析,クラスタリング,クラス分類,次元削除などのアルゴリズムライブラリを試すことができます.
また,IRISデータなど,データ解析では有名なデータについても最初から用意されているので,このサンプルデータを用いた解析は簡単に行うことができます.
ここでは,すでにインストール済みのIRISデータを読み込みます.
興味がある場合,ぜひ,numpy, pandasから読み込んだデータを用いて処理を行うようにしてください.
IRISデータは,Anacondaでインストールされているなら,Lib/site-packages/sklearn/datasets/data の中に幾つかのサンプルデータが用意されてます.
import numpy as np
from sklearn import neighbors, datasets, metrics
#この一行だけでIRISデータを読み込む処理が行なわれます
iris = datasets.load_iris() # 4次元、150サンプルのデータセットで
# 植物の萼(がく)の長さ、幅、 花びらの長さ、幅、単位はcm。
では,読み込んだirisの情報を確認してみます.
iris
{'DESCR': 'Iris Plants Database\n====================\n\nNotes\n-----\nData Set Characteristics:\n :Number of Instances: 150 (50 in each of three classes)\n :Number of Attributes: 4 numeric, predictive attributes and the class\n :Attribute Information:\n - sepal length in cm\n - sepal width in cm\n - petal length in cm\n - petal width in cm\n - class:\n - Iris-Setosa\n - Iris-Versicolour\n - Iris-Virginica\n :Summary Statistics:\n\n ============== ==== ==== ======= ===== ====================\n Min Max Mean SD Class Correlation\n ============== ==== ==== ======= ===== ====================\n sepal length: 4.3 7.9 5.84 0.83 0.7826\n sepal width: 2.0 4.4 3.05 0.43 -0.4194\n petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)\n petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)\n ============== ==== ==== ======= ===== ====================\n\n :Missing Attribute Values: None\n :Class Distribution: 33.3% for each of 3 classes.\n :Creator: R.A. Fisher\n :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)\n :Date: July, 1988\n\nThis is a copy of UCI ML iris datasets.\nhttp://archive.ics.uci.edu/ml/datasets/Iris\n\nThe famous Iris database, first used by Sir R.A Fisher\n\nThis is perhaps the best known database to be found in the\npattern recognition literature. Fisher\'s paper is a classic in the field and\nis referenced frequently to this day. (See Duda & Hart, for example.) The\ndata set contains 3 classes of 50 instances each, where each class refers to a\ntype of iris plant. One class is linearly separable from the other 2; the\nlatter are NOT linearly separable from each other.\n\nReferences\n----------\n - Fisher,R.A. "The use of multiple measurements in taxonomic problems"\n Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to\n Mathematical Statistics" (John Wiley, NY, 1950).\n - Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.\n (Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.\n - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System\n Structure and Classification Rule for Recognition in Partially Exposed\n Environments". IEEE Transactions on Pattern Analysis and Machine\n Intelligence, Vol. PAMI-2, No. 1, 67-71.\n - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions\n on Information Theory, May 1972, 431-433.\n - See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II\n conceptual clustering system finds 3 classes in the data.\n - Many, many more ...\n', 'data': array([[ 5.1, 3.5, 1.4, 0.2], [ 4.9, 3. , 1.4, 0.2], [ 4.7, 3.2, 1.3, 0.2], [ 4.6, 3.1, 1.5, 0.2], [ 5. , 3.6, 1.4, 0.2], [ 5.4, 3.9, 1.7, 0.4], [ 4.6, 3.4, 1.4, 0.3], [ 5. , 3.4, 1.5, 0.2], [ 4.4, 2.9, 1.4, 0.2], [ 4.9, 3.1, 1.5, 0.1], [ 5.4, 3.7, 1.5, 0.2], [ 4.8, 3.4, 1.6, 0.2], [ 4.8, 3. , 1.4, 0.1], [ 4.3, 3. , 1.1, 0.1], [ 5.8, 4. , 1.2, 0.2], [ 5.7, 4.4, 1.5, 0.4], [ 5.4, 3.9, 1.3, 0.4], [ 5.1, 3.5, 1.4, 0.3], [ 5.7, 3.8, 1.7, 0.3], [ 5.1, 3.8, 1.5, 0.3], [ 5.4, 3.4, 1.7, 0.2], [ 5.1, 3.7, 1.5, 0.4], [ 4.6, 3.6, 1. , 0.2], [ 5.1, 3.3, 1.7, 0.5], [ 4.8, 3.4, 1.9, 0.2], [ 5. , 3. , 1.6, 0.2], [ 5. , 3.4, 1.6, 0.4], [ 5.2, 3.5, 1.5, 0.2], [ 5.2, 3.4, 1.4, 0.2], [ 4.7, 3.2, 1.6, 0.2], [ 4.8, 3.1, 1.6, 0.2], [ 5.4, 3.4, 1.5, 0.4], [ 5.2, 4.1, 1.5, 0.1], [ 5.5, 4.2, 1.4, 0.2], [ 4.9, 3.1, 1.5, 0.1], [ 5. , 3.2, 1.2, 0.2], [ 5.5, 3.5, 1.3, 0.2], [ 4.9, 3.1, 1.5, 0.1], [ 4.4, 3. , 1.3, 0.2], [ 5.1, 3.4, 1.5, 0.2], [ 5. , 3.5, 1.3, 0.3], [ 4.5, 2.3, 1.3, 0.3], [ 4.4, 3.2, 1.3, 0.2], [ 5. , 3.5, 1.6, 0.6], [ 5.1, 3.8, 1.9, 0.4], [ 4.8, 3. , 1.4, 0.3], [ 5.1, 3.8, 1.6, 0.2], [ 4.6, 3.2, 1.4, 0.2], [ 5.3, 3.7, 1.5, 0.2], [ 5. , 3.3, 1.4, 0.2], [ 7. , 3.2, 4.7, 1.4], [ 6.4, 3.2, 4.5, 1.5], [ 6.9, 3.1, 4.9, 1.5], [ 5.5, 2.3, 4. , 1.3], [ 6.5, 2.8, 4.6, 1.5], [ 5.7, 2.8, 4.5, 1.3], [ 6.3, 3.3, 4.7, 1.6], [ 4.9, 2.4, 3.3, 1. ], [ 6.6, 2.9, 4.6, 1.3], [ 5.2, 2.7, 3.9, 1.4], [ 5. , 2. , 3.5, 1. ], [ 5.9, 3. , 4.2, 1.5], [ 6. , 2.2, 4. , 1. ], [ 6.1, 2.9, 4.7, 1.4], [ 5.6, 2.9, 3.6, 1.3], [ 6.7, 3.1, 4.4, 1.4], [ 5.6, 3. , 4.5, 1.5], [ 5.8, 2.7, 4.1, 1. ], [ 6.2, 2.2, 4.5, 1.5], [ 5.6, 2.5, 3.9, 1.1], [ 5.9, 3.2, 4.8, 1.8], [ 6.1, 2.8, 4. , 1.3], [ 6.3, 2.5, 4.9, 1.5], [ 6.1, 2.8, 4.7, 1.2], [ 6.4, 2.9, 4.3, 1.3], [ 6.6, 3. , 4.4, 1.4], [ 6.8, 2.8, 4.8, 1.4], [ 6.7, 3. , 5. , 1.7], [ 6. , 2.9, 4.5, 1.5], [ 5.7, 2.6, 3.5, 1. ], [ 5.5, 2.4, 3.8, 1.1], [ 5.5, 2.4, 3.7, 1. ], [ 5.8, 2.7, 3.9, 1.2], [ 6. , 2.7, 5.1, 1.6], [ 5.4, 3. , 4.5, 1.5], [ 6. , 3.4, 4.5, 1.6], [ 6.7, 3.1, 4.7, 1.5], [ 6.3, 2.3, 4.4, 1.3], [ 5.6, 3. , 4.1, 1.3], [ 5.5, 2.5, 4. , 1.3], [ 5.5, 2.6, 4.4, 1.2], [ 6.1, 3. , 4.6, 1.4], [ 5.8, 2.6, 4. , 1.2], [ 5. , 2.3, 3.3, 1. ], [ 5.6, 2.7, 4.2, 1.3], [ 5.7, 3. , 4.2, 1.2], [ 5.7, 2.9, 4.2, 1.3], [ 6.2, 2.9, 4.3, 1.3], [ 5.1, 2.5, 3. , 1.1], [ 5.7, 2.8, 4.1, 1.3], [ 6.3, 3.3, 6. , 2.5], [ 5.8, 2.7, 5.1, 1.9], [ 7.1, 3. , 5.9, 2.1], [ 6.3, 2.9, 5.6, 1.8], [ 6.5, 3. , 5.8, 2.2], [ 7.6, 3. , 6.6, 2.1], [ 4.9, 2.5, 4.5, 1.7], [ 7.3, 2.9, 6.3, 1.8], [ 6.7, 2.5, 5.8, 1.8], [ 7.2, 3.6, 6.1, 2.5], [ 6.5, 3.2, 5.1, 2. ], [ 6.4, 2.7, 5.3, 1.9], [ 6.8, 3. , 5.5, 2.1], [ 5.7, 2.5, 5. , 2. ], [ 5.8, 2.8, 5.1, 2.4], [ 6.4, 3.2, 5.3, 2.3], [ 6.5, 3. , 5.5, 1.8], [ 7.7, 3.8, 6.7, 2.2], [ 7.7, 2.6, 6.9, 2.3], [ 6. , 2.2, 5. , 1.5], [ 6.9, 3.2, 5.7, 2.3], [ 5.6, 2.8, 4.9, 2. ], [ 7.7, 2.8, 6.7, 2. ], [ 6.3, 2.7, 4.9, 1.8], [ 6.7, 3.3, 5.7, 2.1], [ 7.2, 3.2, 6. , 1.8], [ 6.2, 2.8, 4.8, 1.8], [ 6.1, 3. , 4.9, 1.8], [ 6.4, 2.8, 5.6, 2.1], [ 7.2, 3. , 5.8, 1.6], [ 7.4, 2.8, 6.1, 1.9], [ 7.9, 3.8, 6.4, 2. ], [ 6.4, 2.8, 5.6, 2.2], [ 6.3, 2.8, 5.1, 1.5], [ 6.1, 2.6, 5.6, 1.4], [ 7.7, 3. , 6.1, 2.3], [ 6.3, 3.4, 5.6, 2.4], [ 6.4, 3.1, 5.5, 1.8], [ 6. , 3. , 4.8, 1.8], [ 6.9, 3.1, 5.4, 2.1], [ 6.7, 3.1, 5.6, 2.4], [ 6.9, 3.1, 5.1, 2.3], [ 5.8, 2.7, 5.1, 1.9], [ 6.8, 3.2, 5.9, 2.3], [ 6.7, 3.3, 5.7, 2.5], [ 6.7, 3. , 5.2, 2.3], [ 6.3, 2.5, 5. , 1.9], [ 6.5, 3. , 5.2, 2. ], [ 6.2, 3.4, 5.4, 2.3], [ 5.9, 3. , 5.1, 1.8]]), 'feature_names': ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'], 'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]), 'target_names': array(['setosa', 'versicolor', 'virginica'], dtype='<U10')}
何のことやら,一見よくわかりません.
しかし,よくよくデータを見てみると,見覚えのあるデータがちらほらと....
よく見ると,'data': arrayとして,データを持っているようですし,target_nameとしても各あやめの名前になってます.
実は,このようにサンプルでは最初から解析しやすいようにデータが用意されており,回帰,クラスタリング,などを初学者にも勉強しやすいように用意されているのです.
こういう,きちんと例をすべて出すデータを既に用意しているのは,非常に素晴らしいと思います.
ここでは,すべてのデータは説明しません.
各自で調べてデータがどのようになっているのか,確認して下さい.
先述の通り,きちんと例をすべて出すデータを既に用意しているのは,非常に素晴らしいと思います.
しかし,先ほど提示したデータでは,どのように解析して確認するのか非常に難しい状態です.
先ほど,2. ではPandasによる読み込みを理解しました.
であるなら,このScikit-learnで読み込んだデータを pandas, あるいはnumpyの情報へ読み替えることで,解析が行えるのです.
また, scikit-learn で解析を行うなら,そのライブラリが推奨する型へ変換するとよいでしょう.
今回は,pandas による変換を行います.
ぜひNumpyにもチャレンジしてください.
では,pandasのライブラリを用いて,Dataframeへ変換します.
#Dataframeの作成(columsへ各特徴量の名前を設定)
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df
sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | |
---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 |
1 | 4.9 | 3.0 | 1.4 | 0.2 |
2 | 4.7 | 3.2 | 1.3 | 0.2 |
3 | 4.6 | 3.1 | 1.5 | 0.2 |
4 | 5.0 | 3.6 | 1.4 | 0.2 |
5 | 5.4 | 3.9 | 1.7 | 0.4 |
6 | 4.6 | 3.4 | 1.4 | 0.3 |
7 | 5.0 | 3.4 | 1.5 | 0.2 |
8 | 4.4 | 2.9 | 1.4 | 0.2 |
9 | 4.9 | 3.1 | 1.5 | 0.1 |
10 | 5.4 | 3.7 | 1.5 | 0.2 |
11 | 4.8 | 3.4 | 1.6 | 0.2 |
12 | 4.8 | 3.0 | 1.4 | 0.1 |
13 | 4.3 | 3.0 | 1.1 | 0.1 |
14 | 5.8 | 4.0 | 1.2 | 0.2 |
15 | 5.7 | 4.4 | 1.5 | 0.4 |
16 | 5.4 | 3.9 | 1.3 | 0.4 |
17 | 5.1 | 3.5 | 1.4 | 0.3 |
18 | 5.7 | 3.8 | 1.7 | 0.3 |
19 | 5.1 | 3.8 | 1.5 | 0.3 |
20 | 5.4 | 3.4 | 1.7 | 0.2 |
21 | 5.1 | 3.7 | 1.5 | 0.4 |
22 | 4.6 | 3.6 | 1.0 | 0.2 |
23 | 5.1 | 3.3 | 1.7 | 0.5 |
24 | 4.8 | 3.4 | 1.9 | 0.2 |
25 | 5.0 | 3.0 | 1.6 | 0.2 |
26 | 5.0 | 3.4 | 1.6 | 0.4 |
27 | 5.2 | 3.5 | 1.5 | 0.2 |
28 | 5.2 | 3.4 | 1.4 | 0.2 |
29 | 4.7 | 3.2 | 1.6 | 0.2 |
... | ... | ... | ... | ... |
120 | 6.9 | 3.2 | 5.7 | 2.3 |
121 | 5.6 | 2.8 | 4.9 | 2.0 |
122 | 7.7 | 2.8 | 6.7 | 2.0 |
123 | 6.3 | 2.7 | 4.9 | 1.8 |
124 | 6.7 | 3.3 | 5.7 | 2.1 |
125 | 7.2 | 3.2 | 6.0 | 1.8 |
126 | 6.2 | 2.8 | 4.8 | 1.8 |
127 | 6.1 | 3.0 | 4.9 | 1.8 |
128 | 6.4 | 2.8 | 5.6 | 2.1 |
129 | 7.2 | 3.0 | 5.8 | 1.6 |
130 | 7.4 | 2.8 | 6.1 | 1.9 |
131 | 7.9 | 3.8 | 6.4 | 2.0 |
132 | 6.4 | 2.8 | 5.6 | 2.2 |
133 | 6.3 | 2.8 | 5.1 | 1.5 |
134 | 6.1 | 2.6 | 5.6 | 1.4 |
135 | 7.7 | 3.0 | 6.1 | 2.3 |
136 | 6.3 | 3.4 | 5.6 | 2.4 |
137 | 6.4 | 3.1 | 5.5 | 1.8 |
138 | 6.0 | 3.0 | 4.8 | 1.8 |
139 | 6.9 | 3.1 | 5.4 | 2.1 |
140 | 6.7 | 3.1 | 5.6 | 2.4 |
141 | 6.9 | 3.1 | 5.1 | 2.3 |
142 | 5.8 | 2.7 | 5.1 | 1.9 |
143 | 6.8 | 3.2 | 5.9 | 2.3 |
144 | 6.7 | 3.3 | 5.7 | 2.5 |
145 | 6.7 | 3.0 | 5.2 | 2.3 |
146 | 6.3 | 2.5 | 5.0 | 1.9 |
147 | 6.5 | 3.0 | 5.2 | 2.0 |
148 | 6.2 | 3.4 | 5.4 | 2.3 |
149 | 5.9 | 3.0 | 5.1 | 1.8 |
150 rows × 4 columns
これで,先ほど提示したpandasによる統計量の演算ができるようになりました.
余談ですが,上記をみると,なぜか,speciesの項目がありません.
実は,これは別でデータが用意されてます.それが,以下のとおり
iris.target
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
これは数値になっていますが,よく見ると,target_namesがありますので,見てみます.
iris.target_names
array(['setosa', 'versicolor', 'virginica'], dtype='<U10')
つまり,これは,左からそれぞれ配列番号を示しており,IRISの種類を特定してます.
このように,最初から情報としてラベリングされていることが確認できます.
なので,解析で用いる場合,以下の簡単な構文で目的変数と特徴変数へ分離でき,解析がスムーズに行われます.
# 植物の萼(がく)の長さ、幅、 花びらの長さ、幅、単位はcm。
iris_X = iris.data[:, :2] # 4次元のうち、最初の2次元の特徴量だけを使用
iris_y = iris.target # 正解ラベル, 0, 1, 2の3種類
このように,Pythonを用いると,難しい数式にとらわれず,ライブラリ上での処理として簡単に機械学習を体験できるというわけです.