抽樣方法

許sir

[資料設定]

In [1]:
data(iris)
head(iris)
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa
In [10]:
dim(iris)
summary(iris)
str(iris)
  1. 150
  2. 5
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50  
                
                
                
'data.frame':	150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

[PART 1].簡單隨機抽樣

In [3]:
iris[sample(nrow(iris), 5), ] #隨機抽5筆
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
115.4 3.7 1.5 0.2 setosa
1136.8 3.0 5.5 2.1 virginica
1126.4 2.7 5.3 1.9 virginica
1225.6 2.8 4.9 2.0 virginica
995.1 2.5 3.0 1.1 versicolor

[PART 2]. 分層隨機抽樣(Stratified Sampling)

分層隨機抽樣是將母體依照某衡量標準,區分成若干個不重複的子母體,我們稱之為『層』,且層與層之間有很大的變異性,而層內的變異性較小。在區分不同層後,再從每一層中利用簡單隨機抽樣抽出所須比例的樣本數,最後將所得各層樣本合起來即為樣本。利用分層隨機抽樣可保持樣本資料與母體分佈的一致性,在分析資料時也可以減少資料不平衡的問題。

In [4]:
#透過 sampling 套件中的 strata()函數來實現
install.packages("sampling")
library(sampling)
Updating HTML index of packages in '.Library'
Making 'packages.html' ... done
In [16]:
nrow(iris) #樣本筆數
150
In [20]:
n=round(3/5*nrow(iris)/3)  #每種“Species”抽取3/5个樣本進行抽樣
#也就是目標是抽90個

分層抽樣我們可以透過 sampling 套件中的 strata()函數來實現

  • stratanames 為將被作為分層依據的變數名稱;
  • size 用於設定各分層中將要被抽出的樣本數,該值的順序必須與該變數中各水準出現的順序一致,且必須將資料集按照該變數的水準進行升冪排列;
  • method 用於選擇4 種抽樣方法,分別為隨機抽出不放回(srswor)、隨機抽出放回(srswr)、卜松(poisson)及系統抽樣(systematic),預設為隨機抽出不放回;
In [26]:
sub_train=strata(iris,stratanames=("Species"),size=rep(n,5),method="srswor", description = T)
# size=rep(n,5) 建立記錄抽樣結果之數列result,設定模擬抽取5次
# description = T, 會給出共有多少層,每層中帶抽樣本總數及實際抽取樣本數
Stratum 1 

Population total and number of selected units: 50 30 
Stratum 2 

Population total and number of selected units: 50 30 
Stratum 3 

Population total and number of selected units: 50 30 
Number of strata  3 
Total number of selected units 150 
In [29]:
nrow(sub_train)
getdata(iris, sub_train) #檢視分層後的全部資料
90
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpeciesID_unitProbStratum
15.1 3.5 1.4 0.2 setosa 1 0.6 1
24.9 3.0 1.4 0.2 setosa 2 0.6 1
44.6 3.1 1.5 0.2 setosa 4 0.6 1
55.0 3.6 1.4 0.2 setosa 5 0.6 1
74.6 3.4 1.4 0.3 setosa 7 0.6 1
85.0 3.4 1.5 0.2 setosa 8 0.6 1
94.4 2.9 1.4 0.2 setosa 9 0.6 1
104.9 3.1 1.5 0.1 setosa10 0.6 1
124.8 3.4 1.6 0.2 setosa12 0.6 1
134.8 3.0 1.4 0.1 setosa13 0.6 1
175.4 3.9 1.3 0.4 setosa17 0.6 1
195.7 3.8 1.7 0.3 setosa19 0.6 1
205.1 3.8 1.5 0.3 setosa20 0.6 1
225.1 3.7 1.5 0.4 setosa22 0.6 1
245.1 3.3 1.7 0.5 setosa24 0.6 1
265.0 3.0 1.6 0.2 setosa26 0.6 1
275.0 3.4 1.6 0.4 setosa27 0.6 1
285.2 3.5 1.5 0.2 setosa28 0.6 1
304.7 3.2 1.6 0.2 setosa30 0.6 1
314.8 3.1 1.6 0.2 setosa31 0.6 1
345.5 4.2 1.4 0.2 setosa34 0.6 1
354.9 3.1 1.5 0.2 setosa35 0.6 1
365.0 3.2 1.2 0.2 setosa36 0.6 1
384.9 3.6 1.4 0.1 setosa38 0.6 1
424.5 2.3 1.3 0.3 setosa42 0.6 1
434.4 3.2 1.3 0.2 setosa43 0.6 1
445.0 3.5 1.6 0.6 setosa44 0.6 1
475.1 3.8 1.6 0.2 setosa47 0.6 1
484.6 3.2 1.4 0.2 setosa48 0.6 1
505.0 3.3 1.4 0.2 setosa50 0.6 1
1016.3 3.3 6.0 2.5 virginica101 0.6 3
1037.1 3.0 5.9 2.1 virginica103 0.6 3
1046.3 2.9 5.6 1.8 virginica104 0.6 3
1056.5 3.0 5.8 2.2 virginica105 0.6 3
1074.9 2.5 4.5 1.7 virginica107 0.6 3
1107.2 3.6 6.1 2.5 virginica110 0.6 3
1126.4 2.7 5.3 1.9 virginica112 0.6 3
1136.8 3.0 5.5 2.1 virginica113 0.6 3
1155.8 2.8 5.1 2.4 virginica115 0.6 3
1166.4 3.2 5.3 2.3 virginica116 0.6 3
1197.7 2.6 6.9 2.3 virginica119 0.6 3
1225.6 2.8 4.9 2.0 virginica122 0.6 3
1256.7 3.3 5.7 2.1 virginica125 0.6 3
1276.2 2.8 4.8 1.8 virginica127 0.6 3
1286.1 3.0 4.9 1.8 virginica128 0.6 3
1307.2 3.0 5.8 1.6 virginica130 0.6 3
1327.9 3.8 6.4 2.0 virginica132 0.6 3
1336.4 2.8 5.6 2.2 virginica133 0.6 3
1346.3 2.8 5.1 1.5 virginica134 0.6 3
1367.7 3.0 6.1 2.3 virginica136 0.6 3
1386.4 3.1 5.5 1.8 virginica138 0.6 3
1396.0 3.0 4.8 1.8 virginica139 0.6 3
1406.9 3.1 5.4 2.1 virginica140 0.6 3
1435.8 2.7 5.1 1.9 virginica143 0.6 3
1446.8 3.2 5.9 2.3 virginica144 0.6 3
1456.7 3.3 5.7 2.5 virginica145 0.6 3
1466.7 3.0 5.2 2.3 virginica146 0.6 3
1486.5 3.0 5.2 2.0 virginica148 0.6 3
1496.2 3.4 5.4 2.3 virginica149 0.6 3
1505.9 3.0 5.1 1.8 virginica150 0.6 3
In [30]:
#分成訓練集與測試集
data_train=iris[sub_train$ID_unit,]
data_test=iris[-sub_train$ID_unit,]

dim(data_train); dim(data_test)
  1. 90
  2. 5
  1. 60
  2. 5

[PART 3].群集抽樣

群集抽樣的方法就是將母體分成幾個群集(或部落、區域),再從這幾個群集中抽出數個群集進行抽樣或普查。有時群集抽樣又稱部落抽樣或叢聚抽樣。在考慮使 用群集抽樣時,一般會要求各群集對資料整體有較好的代表性,即群集間的變異小,而群集內的變異大。因此當群與群之間差距較大時,群集抽樣常常會出現分佈不廣或樣本代表性較差等缺點。

透過 sampling 套件中的 cluster ()函數來執行群集抽樣, 該函數的參數除了 clustername 與 size 略有差異外,其餘參數的涵義都跟 strata()函數相同。

  • clustername,顧名思義,指用來劃分群組的變數名稱。
  • size 為一正整數,代表欲被抽出的群集數。
In [37]:
data(swissmunicipalities)
xdata=swissmunicipalities

A data frame with 2896 observations on the following 22 variables:

  • CT: Swiss canton.
  • REG: Swiss region.
  • COM: municipality number.
  • Nom: municipality name.
  • HApoly: municipality area.
  • Surfacesbois: wood area.
  • Surfacescult: area under cultivation.
  • Alp: mountain pasture area.
  • Airbat: area with buildings.
  • Airind: industrial area.
  • P00BMTOT: number of men.
  • P00BWTOT: number of women.
  • Pop020: number of men and women aged between 0 and 19.
  • Pop2040: number of men and women aged between 20 and 39.
  • Pop4065: number of men and women aged between 40 and 64.
  • Pop65P: number of men and women aged between 65 and over.
  • H00PTOT: number of households.
  • H00P01: number of households with 1 person.
  • H00P02: number of households with 2 persons.
  • H00P03: number of households with 3 persons.
  • H00P04: number of households with 4 persons.
  • POPTOT: total population.
In [40]:
head(xdata)
#table(xdata$REG)
CTREGCOMNomHApolySurfacesboisSurfacescultAlpAirbatAirindPop020Pop2040Pop4065Pop65PH00PTOTH00P01H00P02H00P03H00P04POPTOT
1 4 261 Zurich 8781 2326 967 0 2884 260 57324 131422 108178 66349 186880 94797 55019 17596 19468 363273
25 1 6621 Geneve 1593 67 31 0 773 60 32429 60074 57063 28398 86231 44373 22145 9761 9952 177964
12 3 2701 Basel 2391 97 93 0 1023 213 28161 50349 53734 34314 86371 44469 24838 7890 9174 166558
2 2 351 Bern 5162 1726 1041 0 1070 212 19399 44263 39397 25575 67115 34981 20222 5859 6053 128634
22 1 5586 Lausanne 4136 1635 714 0 856 64 24291 44202 35421 21000 62258 31205 17122 6515 7416 124914
1 4 230 Winterthur6787 2807 1827 0 972 238 18942 28958 27696 14887 41362 16346 13454 4804 6758 90483
In [45]:
data=xdata[order(xdata$REG),]

st=strata(xdata,stratanames=c("REG"),size=c(30,20,45,15,20,11,44), method="srswor", description = T)

sample = getdata(xdata, st)
Stratum 1 

Population total and number of selected units: 171 30 
Stratum 2 

Population total and number of selected units: 589 20 
Stratum 3 

Population total and number of selected units: 321 45 
Stratum 4 

Population total and number of selected units: 913 15 
Stratum 5 

Population total and number of selected units: 471 20 
Stratum 6 

Population total and number of selected units: 186 11 
Stratum 7 

Population total and number of selected units: 245 44 
Number of strata  7 
Total number of selected units 185 
In [46]:
getdata(st, sample) #檢視分層後的全部資料
#sample
CTCOMNomHApolySurfacesboisSurfacescultAlpAirbatAirindP00BMTOTH00PTOTH00P01H00P02H00P03H00P04POPTOTREGID_unitProbStratum
291 243 Dietikon 938 254 160 0 190 78 10630 9707 3702 3189 1199 1617 21353 4 29 0.1754386 1
731 53 Bulach 1612 634 535 0 200 50 6842 5985 2006 1959 821 1199 13999 4 73 0.1754386 1
771 247 Schlieren 659 184 128 0 123 73 6719 6159 2546 1916 763 934 13356 4 77 0.1754386 1
951 158 Stafa 858 163 395 0 194 12 5593 5071 1761 1796 610 904 11567 4 95 0.1754386 1
1261 177 Pfaffikon 1956 462 843 0 172 21 4729 3890 1249 1276 518 847 9592 4 126 0.1754386 1
1451 115 Gossau (ZH) 1827 261 1225 0 175 11 4329 3392 841 1224 467 860 8685 4 145 0.1754386 1
1561 155 Mannedorf 477 118 163 0 115 6 3948 3715 1309 1366 446 594 8348 4 156 0.1754386 1
2861 159 Uetikon am See 345 59 165 0 83 8 2438 2071 628 736 281 426 5210 4 286 0.1754386 1
3531 157 Oetwil am See 612 75 415 0 59 11 2169 1736 575 499 240 422 4375 4 353 0.1754386 1
3661 171 Bauma 2074 1126 733 12 86 18 2133 1579 443 474 218 444 4259 4 366 0.1754386 1
3781 111 Baretswil 2224 870 1142 0 90 5 2089 1605 408 595 223 379 4172 4 378 0.1754386 1
4281 251 Weiningen (ZH) 537 206 176 0 59 8 1948 1597 551 498 218 330 3791 4 428 0.1754386 1
4381 9 Mettmenstetten 1302 249 890 0 85 4 1875 1415 373 458 205 379 3724 4 438 0.1754386 1
5241 116 Gruningen 877 180 547 0 80 5 1514 1105 321 352 141 291 3092 4 524 0.1754386 1
6341 13 Stallikon 1201 617 474 0 61 4 1349 1078 275 433 156 214 2608 4 634 0.1754386 1
7591 57 Freienstein-Teufen 837 413 331 0 36 4 1066 799 189 285 109 216 2127 4 759 0.1754386 1
8151 114 Fischenthal 3029 1911 853 102 48 6 978 726 201 221 109 195 1961 4 815 0.1754386 1
8511 94 Otelfingen 716 267 341 0 32 24 927 741 182 250 127 182 1852 4 851 0.1754386 1
8581 33 Kleinandelfingen 1035 347 528 0 54 3 892 690 172 232 105 181 1821 4 858 0.1754386 1
8651 35 Marthalen 1412 557 696 0 49 9 882 684 197 207 108 172 1803 4 865 0.1754386 1
8751 181 Wila 916 470 353 0 44 11 914 715 221 218 93 183 1793 4 875 0.1754386 1
11271 119 Seegraben 377 58 187 0 32 3 676 481 132 154 68 127 1279 4 1127 0.1754386 1
12331 99 Schofflisdorf 403 187 172 0 31 1 542 443 118 161 54 110 1133 4 1233 0.1754386 1
13341 65 Oberembrach 1025 346 597 0 32 0 492 398 109 146 50 93 990 4 1334 0.1754386 1
13381 212 Bertschikon 971 193 664 0 35 2 505 335 54 117 54 110 985 4 1338 0.1754386 1
14521 6 Kappel am Albis 792 167 553 0 26 0 445 329 87 100 54 88 865 4 1452 0.1754386 1
14601 134 Hutten 728 265 384 0 19 0 442 324 82 117 33 92 860 4 1460 0.1754386 1
17261 211 Altikon 772 158 527 0 26 0 310 228 50 80 31 67 613 4 1726 0.1754386 1
20651 175 Kyburg 761 464 242 0 16 3 206 147 33 51 19 44 396 4 2065 0.1754386 1
23231 43 Volken 318 91 210 0 8 0 143 100 25 36 12 27 268 4 2323 0.1754386 1
97421 5225 Sorengo 85 13 22 0 34 0 715 618 209 170 131 108 1557 7 974 0.1795918 7
97621 5072 Faido 372 236 48 0 39 4 725 614 214 170 106 124 1548 7 976 0.1795918 7
100021 5151 Bioggio 305 80 98 0 35 25 689 635 187 202 114 132 1504 7 1000 0.1795918 7
101921 5285 Lodrino 3150 2150 201 97 47 11 752 549 117 162 121 149 1461 7 1019 0.1795918 7
104521 5253 Ligornetto 202 45 83 0 36 4 672 571 141 188 114 128 1408 7 1045 0.1795918 7
107121 5262 Rancate 231 61 68 0 34 13 657 558 148 184 120 106 1353 7 1071 0.1795918 7
119221 5148 Bedano 187 100 29 2 26 12 564 441 102 135 96 108 1196 7 1192 0.1795918 7
126921 5107 Gerra (Verzasca) 1868 1075 51 87 55 1 517 467 133 159 90 85 1098 7 1269 0.1795918 7
131021 5187 Gravesano 69 16 15 0 25 2 508 402 93 115 106 88 1022 7 1310 0.1795918 7
142621 5133 Verscio 300 196 25 14 22 0 440 375 105 120 75 75 887 7 1426 0.1795918 7
152321 5202 Monteggio 336 164 91 0 45 3 388 352 118 117 60 57 784 7 1523 0.1795918 7
153921 5213 Ponte Tresa 41 18 1 0 13 0 353 373 143 129 58 43 769 7 1539 0.1795918 7
164021 5219 Rovio 553 435 24 16 28 0 327 281 76 97 58 50 673 7 1640 0.1795918 7
179421 5195 Maroggia 100 58 4 0 17 2 288 268 99 89 53 27 562 7 1794 0.1795918 7
181621 5149 Bedigliora 248 191 23 0 21 2 261 233 81 65 42 45 540 7 1816 0.1795918 7
191521 5265 Salorino 498 433 35 6 16 0 242 198 55 61 39 43 487 7 1915 0.1795918 7
207321 5267 Tremona 158 104 28 0 18 0 192 158 43 48 31 36 393 7 2073 0.1795918 7
209621 5069 Chiggiogna 392 208 36 0 10 7 185 166 61 47 23 35 378 7 2096 0.1795918 7
215421 5206 Neggio 91 60 14 0 8 0 173 137 47 46 12 32 352 7 2154 0.1795918 7
225721 5303 Bignasco 8151 2367 32 550 10 0 168 114 39 24 19 32 306 7 2257 0.1795918 7
226021 5135 Vogorno 2388 1421 23 299 15 0 151 147 68 31 25 23 304 7 2260 0.1795918 7
236321 5159 Breno 575 399 24 46 11 0 122 126 54 39 14 19 255 7 2363 0.1795918 7
249121 5313 Giumaglio 1316 680 20 60 5 0 96 81 20 27 14 20 202 7 2491 0.1795918 7
254021 5244 Bruzella 344 300 33 1 2 0 94 74 18 25 12 19 183 7 2540 0.1795918 7
276321 5183 Fescoggia 245 217 12 4 4 0 41 40 13 16 6 5 88 7 2763 0.1795918 7
280321 5092 Auressio 299 198 3 21 2 0 33 31 9 11 6 5 71 7 2803 0.1795918 7
281021 5032 Campo (Blenio) 2196 556 52 754 3 0 39 32 11 14 4 3 68 7 2810 0.1795918 7
281821 5132 Vergeletto 4078 1750 18 633 3 1 29 35 18 8 5 4 65 7 2818 0.1795918 7
283521 5307 Campo (Vallemaggia)4327 1922 89 487 18 0 28 30 13 10 5 2 58 7 2835 0.1795918 7
286321 5067 Campello 396 114 35 121 12 0 25 22 10 4 6 2 45 7 2863 0.1795918 7