💡

⚛️ 統計學與機器學習

敘述統計學 Descriptive Statistics

‣

人口普查 Census

推論統計學 Inferential Statistics

多變量分析 Multivariate Analysis

‣

算術平均數 arithmetic mean

‣

幾何平均數 geometric mean

‣

調和平均數 harmonic mean

算術 $\geqq$ 幾何 $\geqq$ 調和

資料由小至大排序，取四等份的分割點

第1四分位數
中位數 median（第2四分位數）
第3四分位數

四分位距 IQR = 第3四分位數 − 第1四分位數

眾數 mode：出現次數最多的數值
最小值、最大值

全距 range = 最大值 − 最小值

‣

主成分分析　　　　　　　　#資料降維 PCA

‣

迴歸分析　　　　　　　　　#預測 Regression Analysis

離散

圖片來源

離散隨機變數

骰子點數
真假值
人口數

機率質量函數 Probability Mass Function

Let $X$ be a discrete random variable on a sample space $S$ . Then the probability mass function is

$f(x) = \operatorname {Pr}(X=x)$
$f(x)≥0 \quad \text{for all } x∈S$
$\displaystyle \sum_{x \in S} f(x)=1$

離散均勻分布 Discrete Uniform Distribution

伯努利分布 Bernoulli Distribution

二項式分布 Binomial Distribution

Poisson分布（卜瓦松、帕松）（對應指數分布）
幾何分布 Geometric Distribution
超幾何分布 hypergeometric Distribution

連續隨機變數

身高體重

機率密度函數 Probability Density Function

Let $X$ be a continuous random variable. A probability density function $f(x)$ is an integrable

function where

$\int_a^b f(x)dx = \operatorname{Pr}(a<X≤b)$
$f(x)≥0 \quad \text{for all } x∈\Reals$
$\int_{-\infin}^\infin f(x)dx = 1$
注意： $f(x)$ 的值並不是機率

連續均勻分布 Continuous Uniform Distribution

高斯分布、常態分布 Normal Distribution

t分布 Stedent’s t-Distribution

f分布 Fisher–Snedecor Distribution

對數常態分布 Log-normal Distribution
卡方分布 Chi-squared Distribution

Gamma分布
Beta分布

指數分布 Exponential Distribution（對應 Poisson分布）

圖片來源

中央極限定理 Central Limit Theorem

從平均數 $\mu$ 、變異數 $\sigma^2 \space$ 的母體中抽出大小為 $\space n \space$ 的樣本；

若 $\space n \space$ 的值愈大，樣本平均數 $\overline x \space$ 的機率分配就會趨於「期望值 $\mu$ 、變異數 $\dfrac{\sigma^2}{n} \space$ 的常態分配」

信賴區間 Confidence Interval

用來說明你對一個估計的參數落在特定區間的信心程度

注意：95% 信賴區間，並不等於特定區間包含真實值的機率為 95%

檢定力 Power of a test

Bruce B. Frey《你需要的統計指南在這裡！》，以 Cookbook 形式條列介紹每個檢定與統計方式（《There's a Stat for That!》）

如何設計研究
要處理的統計問題
適用的研究內容
數據範例
使用上的考量
實際的論文 ref.

（作者的前一本書，《愛上統計學：使用SPSS》，書末很多英文資源、統計學歷史資源）

p-value 的六項指引

學術界長期濫用 p值、顯著性，生產了大量錯誤、無效的研究論文，因此 ASA 於 2016 年發佈了以下內容

p值可以指出資料與該模型的不相容程度
p值無法測量「虛無假設為真」的機率，也無法測量「資料是因隨機而生」的機率
科學結論與商業決策，都不應該只靠 p值來下判斷
正確的推論(Inference)需要完整的報告、完全透明
p值、顯著性，都不能測量效用的大小或是結果的重要性
p值本身，無法測量模型與假設的「證據」

（《不敗的數據學》Alex Reinhart，提供了許多統計實驗的操作細節，避免人們做出錯誤或虛假的結果）

聯合機率，邊際機率，條件機率

	$\operatorname{Pr}(A)$ 白內障	$\operatorname{Pr}(\sim A)$ 無白內障	邊際機率 Marginal Probability 　　　↓
$\operatorname{Pr}(B)$ 黃斑部病變	0.2	0.05	0.25
$\operatorname{Pr}(\sim B)$ 無黃斑部病變	0.6	0.15	0.75
邊際機率 Marginal Probability　→	0.8	0.2	↓ →　總合皆為１

藍色區塊為，聯合機率 Joint Probability

$\operatorname{Pr}(A \cap B) \quad \quad = 0.2$
$\operatorname{Pr}(A \cap \sim B) \quad = 0.6$
$\operatorname{Pr}(\sim A \cap B) \quad = 0.05$
$\operatorname{Pr}(\sim A \cap \sim B) = 0.15$

淺紅區塊為，邊際機率 Marginal Probability

$\operatorname{Pr}(A) \quad = 0.8 \quad$ $\quad = \operatorname{Pr}(A \cap B) + \operatorname{Pr}(A \cap \sim B)$
$\operatorname{Pr}(B) \quad = 0.25 \quad$ $\quad = \operatorname{Pr}(A \cap B) + \operatorname{Pr}(\sim A \cap B)$
$\operatorname{Pr}(\sim A) = 0.2 \quad \; \;$ $\quad = \operatorname{Pr}(\sim A \cap B) + \operatorname{Pr}(\sim A \cap \sim B)$
$\operatorname{Pr}(\sim B) = 0.75 \quad$ $\quad = \operatorname{Pr}(A \cap \sim B) + \operatorname{Pr}(\sim A \cap \sim B)$

條件機率 Conditional Probability

條件機率的表示方式 $\colorbox{#00EE00}{$\operatorname{Pr}(A \mid B)$}$ $\operatorname{Pr}(A \mid \sim B)$ $\colorbox{#D4A1FF}{$\operatorname{Pr}(B \mid A)$}$ $\operatorname{Pr}(B \mid \sim A)$

條件機率的意思 $\colorbox{#00EE00}{$\operatorname{Pr}(A \mid B)$}$ 　是指「在 $\ B$ 發生的前提下， $\ A$ 發生的機率」 $\operatorname{Pr}(A \mid \sim B)$ 　是指「在 $\ \sim B$ 發生的前提下， $\ A$ 發生的機率」以此類推。

條件機率的重要觀念（可搭配上列表格以及自己畫個文氏圖來理解）

$\colorbox{#00EE00}{$\operatorname{Pr}(A \mid B)$} \color{red}{\bold \; \not = \;} \color{black} \colorbox{#D4A1FF}{$\operatorname{Pr}(B \mid A)$}$
$\colorbox{#00EE00}{$\operatorname{Pr}(A \mid B)$}=\dfrac{\colorbox {yellow}{$\operatorname{Pr}(A \cap B)$}} {\operatorname{Pr}(B)} \qquad \cdots \cdots \fcolorbox {black}{white}{公式①}$
$\colorbox{#D4A1FF}{$\operatorname{Pr}(B \mid A)$}=\dfrac{\colorbox {yellow}{$\operatorname{Pr}(B \cap A)$}} {\operatorname{Pr}(A)} \qquad \cdots \cdots \fcolorbox {black}{white}{公式②}$
如果事件 $\ A$ 與事件 $\ B$ 互相獨立，那麼 $\colorbox{#00EE00}{$\operatorname{Pr}(A \mid B)$} = \operatorname{Pr}(A \mid \sim B) = \operatorname{Pr}(A)$ 所以 $\operatorname{Pr}(A) * \operatorname{Pr}(B) = \colorbox {yellow}{$\operatorname{Pr}(A \cap B)$}$

貝氏定理 Bayes’ Theorem

根據條件機率的 $\fcolorbox {black}{white}{公式①}$ 與 $\fcolorbox {black}{white}{公式②}$ ：

$\colorbox{#00EE00}{$\operatorname{Pr}(A \mid B)$} * \operatorname{Pr}(B) = \colorbox{#D4A1FF}{$\operatorname{Pr}(B \mid A)$} * \operatorname{Pr}(A)$

於是可得出貝氏定理：

$\colorbox{#00EE00}{$\operatorname{Pr}(A \mid B)$} = \dfrac {\colorbox{#D4A1FF}{$\operatorname{Pr}(B \mid A)$} * \operatorname{Pr}(A)} {\operatorname{Pr}(B)}$

‣

「機率」是什麼？

貝氏推論 Bayesian Inference

貝氏推論是一種「科學方法」，透過新獲得的資料，來調整對於假設的主觀信心程度

《寫給大家的統計學》Will Kurt

貝氏統計，是一種用來描述自己相信的世界觀、對待「不確定性」的推理工具

$\underset {貝氏定理} {\undergroup {\colorbox{#00EE00}{$\operatorname{Pr}(A \mid B)$} = {\dfrac{\colorbox{#D4A1FF}{$\operatorname{Pr}(B \mid A)$} * \operatorname{Pr}(A)}{\operatorname{Pr}(B) }}}}=\dfrac{\operatorname{Pr}(B \mid A) * \operatorname{Pr}(A)}{\operatorname{Pr}(A \cap B)+\operatorname{Pr}(\sim A \cap B)}=\underset {用於貝氏推論} {\undergroup {\colorbox{#F7F4Fa}{$\dfrac{\operatorname{Pr}(B \mid A) * \operatorname{Pr}(A)}{\operatorname{Pr}(B \mid A) * \operatorname{Pr}(A)+\operatorname{Pr}(B \mid \sim A) * \operatorname{Pr}(\sim A)}$}}}$

歸納法 Induction，演繹法 Deduction

經由假設、事實、實驗設計、資料等步驟，推導出意見或結論

$\color{gray}{[用語轉換]} \\ \space \\ \color{gray}{A \to H（假設Hypothesis、\fbox{原因}）}\\ \space \\ \color{gray}{B \to D（資料Data、\qquad \quad \fbox{結果}）}$

$\colorbox{#00EE00} {$\operatorname{Pr}(A \mid B)$} = \underset {用於貝氏推論} {\undergroup {\colorbox{#F7F4Fa}{$\dfrac{\operatorname{Pr}(B \mid A) * \operatorname{Pr}(A)}{\operatorname{Pr}(B \mid A) * \operatorname{Pr}(A)+\operatorname{Pr}(B \mid \sim A) * \operatorname{Pr}(\sim A)}$}}} \qquad \overset{替換成貝氏推論的用語}{\longrightarrow} \qquad \underset{\large 後驗機率}{\underset {\large \uparrow}{\fcolorbox{red}{#00EE00} {$\operatorname{Pr}(H \mid D)$}}} = \dfrac { \overset{\large 概似度}{\overset{\large \searrow }{\fcolorbox{red}{white}{${\operatorname{Pr}(D \mid H)}$}}} * \overset{\large 先驗機率 }{\overset{\large \swarrow }{\fcolorbox{red}{white}{$\operatorname{Pr}(H)$}}}}{\operatorname{Pr}(D \mid H) * \operatorname{Pr}(H)+\operatorname{Pr}(D \mid \sim H) * \operatorname{Pr}(\sim H)}$

概似度 Likehood

透過觀測到的資料而推測出來的「機率」；

概似度關心的是，已發生且已知結果的事件

先驗機率 Prior Probability

在獲得資料D之前，原因H成立的機率；

以主觀（subjective）來決定「覺得每項假設為真」的信心程度(degree of belief)

後驗機率 Posterior Probability

納入資料D計算後，原因H成立的機率

馬可夫鏈蒙地卡羅法 MCMC（Markov chain Monte Carlo）

蒙地卡羅 Monte Carlo 是一個從「已知的分佈」中，隨機採樣（generate）資料的方法
馬可夫鏈 Markov chain 是一個遍歷（traversal）模型

混淆矩陣 Confusion Matrix

\begin{array}{|c|c|c|c|c|c|}\hline & & {實際狀況} & {實際狀況} \\ \hline \\ & & \footnotesize \bold{陽性} & \footnotesize \bold{陰性} \\ \\ \hline {預測} & & {真陽性 \colorbox{#AAFFAA}{TP}} & {偽陽性 \colorbox{#FFAAAA}{FP}} & \text{陽性預測值 PPV} & \text{錯誤發現率 FDR} \\ {模擬} & \footnotesize \bold{陽性} & {\text{（True Positive）}} & \text{{（False Positive）}} & \dfrac{\colorbox{#AAFFAA}{TP}}{\colorbox{#AAFFAA} {TP}+\colorbox{#FFAAAA}{FP}} \fcolorbox{pink}{white}{$ \bold{精確度 \atop Precision}$} & \dfrac{\colorbox{#FFAAAA}{FP}}{\colorbox{#AAFFAA}{TP}+\colorbox{#FFAAAA}{FP}} \\ {快篩} & & & \fcolorbox{#AAAAFF}{yellow}{型一錯誤} & \footnotesize \text{{（Postive Predictive Value）}} & \footnotesize \text{{（False Discovery Rate）}} \\ \\ \hline {預測} & & {偽陰性 \colorbox{#CC7777}{FN}} & {真陰性 \colorbox{#55AA55}{TN}} & \text{錯誤遺漏率 FOR} & \text{陰性預測值 NPV} \\ {模擬} & \footnotesize \bold{陰性} & \text{（False Negative）} & \text{（True Negative）} & \dfrac{\colorbox{#CC7777}{FN}}{\colorbox{#CC7777}{FN}+\colorbox{#55AA55}{TN}} & \dfrac{\colorbox{#55AA55}{TN}}{\colorbox{#CC7777}{FN}+\colorbox{#55AA55}{TN}} \\ {快篩} & & \fcolorbox{#AAAAFF}{yellow}{型二錯誤} & & \footnotesize \text{（False Omission Rate）} & \footnotesize \text{（Negative Predictive Value）} \\ \\ \hline & & \text{真陽性率 TPR} & \text{偽陽性率 FPR} \\ & & \fcolorbox{pink}{white}{$ \bold{靈敏度 \atop Sensitivity}$} \space \fcolorbox{pink}{white}{$ \bold{召回率 \atop Recall}$} \\ & & \dfrac{\colorbox{#AAFFAA}{TP}}{\colorbox{#AAFFAA}{TP} + \colorbox{#CC7777}{FN}} & \dfrac{\colorbox{#FFAAAA}{FP}}{\colorbox{#FFAAAA}{FP} + \colorbox{#55AA55}{TN}} \\ \\ \hline & & \text{偽陰性率 FNR} & \text{真陰性率 TNR} \\ & & & \fcolorbox{pink}{white}{$ \bold{特異度 \atop Specificity}$} & & \fcolorbox{pink}{white}{$ \bold{正確率 \atop Accuracy}$} \\ & & \dfrac{\colorbox{#CC7777}{FN}}{\colorbox{#AAFFAA}{TP} + \colorbox{#CC7777}{FN}} & \dfrac{\colorbox{#55AA55}{TN}}{\colorbox{#FFAAAA}{FP} + \colorbox{#55AA55}{TN}} & & \dfrac{\colorbox{#AAFFAA}{TP}+\colorbox{#55AA55}{TN}}{TP+TN+FP+FN} \\ \\ \hline \\ & & & & \fcolorbox{pink}{white}{$ \bold{f1-score}$} \\ & & & & 2* \dfrac{Precision*Recall}{Precision+Recall} \\ & & & & \footnotesize （精確度和召回率的\fcolorbox{white}{#FBF3CA}{調和平均數}） \\ \\ & & & & \scriptsize \textcolor{#BBBBBB}{F_\beta = (1+\beta^2)*\dfrac{Precision*Recall}{\beta^2*Precision+Recall}} \\ \\ \hline \end{array}

圖片來源：MathWorks

機器學習 Machine Learning

資料處理

資料標準化

z-score Standardization： $z=\dfrac{x-\mu}{\sigma}$
Min-max Normalization： $x_{scaled} = \dfrac{x-x_{min}}{x_{max} - x_{min}}，\space \footnotesize x \space 介於[0,1]$

非線性轉換

$Logistic \space$ 函數最簡單的形式： $Sigmoid \space$ 函數可以將數值壓在 $[0,1]， \space s(x) = \dfrac {1}{1+e^{-ax}} \space \footnotesize，a 用來調整曲線陡度$
$tanh \space$ 函數可以將數值壓在 $[-1,1]， \space tanh(x) = \dfrac {e^{ax} - e^{-ax}}{e^{ax} + e^{-ax}} \space \footnotesize，a 用來調整曲線陡度$

資料降維

特徵選取

迴歸問題與分類問題各有作法，以下以 scikit-learn 為例

（迴歸）f_regression：Univariate linear regression tests returning F-statistic and p-values
（分類、ANOVA）f_classif：Compute the ANOVA F-value for the provided sample
（分類、卡方）chi2：Compute chi-squared stats between each non-negative feature and class
（迴歸、連續）mutual_info_regression：Estimate mutual information for a continuous target variable
（分類、離散）mutual_info_classif：Estimate mutual information for a discrete target variable
SequentialFeatureSelector：順序特徵選取，執行時間會較久

特徵萃取

主成分分析 PCA：常用於非監督式學習
線性區別分析 LDA：用於監督式學習

目標函數 Objective Function 成本函數 Cost Function 損失函數 Loss Function

上列函數名稱在實際使用上並無嚴格定義。

總之用來取得最大化(成果)或最小化(誤差)的函數都會是這些名字。

迴歸

均方誤差 MSE（Mean Square Error）
平均絕對誤差 MAE（Mean Absolute Error）
均方對數誤差 MSLE（Mean Squared Logarithmic Error）
均方根誤差 RMSE（將MSE取根號）
MAPE（將MAE取百分比）
均方根對數誤差 RMSLE（將MSLE曲跟好）

梯度下降 Gradient Descent

各種模型都能應用的最佳化演算法，反覆調整並取得最佳的參數，將損失函數最小化

監督式學習演算法

迴歸 Regression

對資料的極端值較敏感
〖分類〗Logistic regression

圖片來源

〖迴歸〗Linear regression

圖片來源

支持向量機 Support Vector Machine (SVM)

不適用於小資料集
主要用於〖分類〗

圖片來源

單純貝氏分類器 Naive Bayes Classifier

基於貝氏定理 Bayes’ Theorem 的〖分類〗模型
Gaussian
Bernoulli
Multinomial

決策樹 Decision Tree

相對簡單直覺的演算法，容易解釋、視覺化

圖片來源

隨機森林 Random Forest

非常複雜的決策樹 ensemble
預測能力佳，但結果難以解釋

圖片來源

$k$ -最近鄰演算法 $k$ -Nearest Neighbors（KNN）

相對簡單的演算法，容易解釋，容易顯示出異常值
若資料集的分類規模不一致、或是維度較高，會使 KNN 效果較差

圖片來源

神經網路 Artificial Neural Network

非常複雜，需要較大的資料集與較多的運算資源
效果最佳，但難以解釋

圖片來源

常見的偏差

辛普森悖論 Simpson’s paradox

在分組數據裡呈現某種趨勢，但該趨勢在分組數據合為一組後，卻消失或反轉。

Cherry picking

採用有利於特定立場與結論的個案與特例，忽略不利於特地立場與結論的整體數據。

確認偏誤 Confirmation bias

傾向於收集符合自身信念與價值的資訊，來增強原先相信的想法；常見的行為有「模糊定義、重新解釋並修改記憶」。

p-hacking

濫用資料分析方法，以取得想要的統計顯著（Statistical Significance）結果。

‣

即使演算法在分群、分類、分析、預測的表現很好，但仍然有一些潛在問題

Bradford Hill Criteria

流行病學用來判斷因果關係（Causality、Causation、Cause and effect）的準則，又稱 Hill’s criteria。

1. 強度 Strength

原因與結果之間的關聯（Association）有多強烈

2. 一致性 Consistency

不同研究者在不同地方使用不同的樣本觀察到了一樣的結果；再現性（Reproducibility）

3. 特異度 Specificity

特定地方的特定人群裡存在著無其他解釋的疾病，那麼其因果關係的機率會較高

4. 時間性 Temporality

原因必須發生在結果之前

5. 劑量反應關係 Dose-response relationship

較大的曝露量通常代表著更大的發生機率；生物梯度（Biological gradient）

6. 合理性 Plausibility

原因與結果之間有學理能說明其機制（但學理機制可能受限於當下的科學知識）

7. 同調 Coherence

流行病學的調查結果與以往的實驗室研究結果相符

8. 實驗 Experiment

透過嚴謹的實驗證據證明

9. 類比 Analogy

與其他相似的關聯類比

不確定性 Uncertainty（in Metrology）

參考資料

A Type A evaluation of standard uncertainty may be based on any valid statistical method for treating data. Examples are calculating the standard deviation of the mean of a series of independent observations; using the method of least squares to fit a curve to data in order to estimate the parameters of the curve and their standard deviations; and carrying out an analysis of variance (ANOVA) in order to identify and quantify random effects in certain kinds of measurements.
A Type B evaluation of standard uncertainty is usually based on scientific judgment using all of the relevant information available, which may include

previous measurement data,
experience with, or general knowledge of, the behavior and property of relevant materials and instruments
manufacturer's specifications
data provided in calibration and other reports
uncertainties assigned to reference data taken from handbooks

自然常數 e=2.71828，以及自然對數 ln

$e$ 是無理數、超越數
$e = 2.71828 \space \color{gray}{18284} \color{black} \dots = \underset {n \to \infty } \lim \bigg ( 1 + \dfrac{1}{n} \bigg)^n = \sum_{n=0}^{\infin} \dfrac{1}{n!} = 1 + \dfrac{1}{1} + \dfrac{1}{1*2} + \dfrac{1}{1*2*3} + \dots$
$e = \underset {h \to 0} \lim \bigg(1 +h \bigg)^{\dfrac{1}{h}}$ ，經過等號兩邊各 $h$ 次方的推導， $\underset{h \to 0} \lim \colorbox{lightgray}{$\dfrac{e^h - 1}{h}$} = 1$

$\operatorname{f}(x) = e^x，\quad \colorbox{#FFAAAA}{f\' (x)} = \underset{h \to 0} \lim \dfrac{f(h+h) - f(x)}{h} = \underset{h \to 0} \lim \space \dfrac{e^xe^h - e^x}{h} = \underset{h \to 0} \lim \space e^x \colorbox{lightgray}{$(\dfrac{e^h - 1}{h})$} = \colorbox{#FFAAAA}{$e^x$}$

$\log_e x = \operatorname{ln}x$
$\operatorname{ln}a = \int_1^a \dfrac{1}{x} \space dx$

資料科學的工具

書目

機器學習

《Bayesian Statistics for Beginners：A Step-by-Step Approach》

《AI 必須！從做中學貝氏統計》

黃志勝《機器學習的統計基礎》
西內啟《機器學習的數學基礎：AI、深度學習打底必讀》

統計

涌井良幸、涌井貞美《誰都看得懂的統計學超圖解》
2015-08 二版，國立臺灣大學出版中心《統計與生活》

通識教科書

《臥底經濟學家的10堂數據偵探課》

搭配歷史實際案例，教你辨識每個統計陷阱，使用正確的情緒與態度解讀數據，以及正確地執行統計實驗

《寫給大家的統計學》
《簡單到不可思議的貝氏統計學》

《Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and Mlops》

《機器學習設計模式》

母體	樣本	不偏估計
$n$	$n$	$n-1$
$\mu$	$\overline x$	$\overline x=\hat \mu$
$\sigma^2$	$s^2$	$\hat \sigma^2$
$\sigma$	$s$	$\hat \sigma$

⚛️ 統計學與機器學習

敘述統計學 Descriptive Statistics

人口普查 Census

推論統計學 Inferential Statistics

多變量分析 Multivariate Analysis

主成分分析 #資料降維 PCA

迴歸分析 #預測 Regression Analysis

離散

相關 Correlation

離差（或稱偏差） Deviation

變異數 Variance

標準差 Standard Deviation

比較不同資料的離散程度

調整數值的作法

共變異數 Covariance

相關係數 Correlation Coefficient

機率 Probability

大數法則 Law of Large Numbers

全機率 Law of Total Probability

隨機變數 Random Variables，機率分布 Probability Distribution

離散隨機變數

機率質量函數 Probability Mass Function

連續隨機變數

機率密度函數 Probability Density Function

中央極限定理 Central Limit Theorem

信賴區間 Confidence Interval

檢定力 Power of a test

p-value 的六項指引

聯合機率，邊際機率，條件機率

藍色區塊為，聯合機率 Joint Probability

淺紅區塊為，邊際機率 Marginal Probability

條件機率 Conditional Probability

貝氏定理 Bayes’ Theorem

「機率」是什麼？

貝氏推論 Bayesian Inference

歸納法 Induction，演繹法 Deduction

概似度 Likehood

先驗機率 Prior Probability

後驗機率 Posterior Probability

馬可夫鏈蒙地卡羅法 MCMC（Markov chain Monte Carlo）

混淆矩陣 Confusion Matrix

機器學習 Machine Learning

資料處理

資料標準化

非線性轉換

資料降維

特徵選取

特徵萃取

目標函數 Objective Function 成本函數 Cost Function 損失函數 Loss Function

迴歸

分類

梯度下降 Gradient Descent

監督式學習演算法

迴歸 Regression

支持向量機 Support Vector Machine (SVM)

單純貝氏分類器 Naive Bayes Classifier

決策樹 Decision Tree

隨機森林 Random Forest

kkk-最近鄰演算法 kkk-Nearest Neighbors（KNN）

神經網路 Artificial Neural Network

常見的偏差

辛普森悖論 Simpson’s paradox

Cherry picking

確認偏誤 Confirmation bias

p-hacking

即使演算法在分群、分類、分析、預測的表現很好，但仍然有一些潛在問題

Bradford Hill Criteria

1. 強度 Strength

2. 一致性 Consistency

3. 特異度 Specificity

4. 時間性 Temporality

5. 劑量反應關係 Dose-response relationship

6. 合理性 Plausibility

7. 同調 Coherence

8. 實驗 Experiment

9. 類比 Analogy

不確定性 Uncertainty（in Metrology）

自然常數 e=2.71828，以及自然對數 ln

資料科學的工具

書目

主成分分析　　　　　　　　#資料降維 PCA

迴歸分析　　　　　　　　　#預測 Regression Analysis

$k$ -最近鄰演算法 $k$ -Nearest Neighbors（KNN）