Data Visualization with Python

許sir

[建議].先行將seaborn update到0.9版

  • pip3 install seaborn==0.9.0

[基本說明]. Visualization of seaborn

  • #### seaborn是一個建立在matplot之上,可用於製作豐富和非常具有吸引力統計圖形的Python庫。
  • #### Seaborn旨在將可視化作爲探索和理解數據的核心部分,有助於幫人們更近距離了解所研究的數據集。

[前置作業].matplotlib 與 seaborn 中文顯示問題

Python資料視覺化主要有三大套件:

  • 'Matplotlib': 指令功能齊全,歷史最悠久的套件,己乎沒有圖示畫不出來的,但也最複雜
  • 'Seaborn': 是在Matplot的基礎上實作更高階的視覺化API,可以讓畫圖變得更方便、容易。
  • 'Plotly': 好看的各種視覺化圖表,可做出互動化的介面。但Plotly並沒有在Anaconda裡面,所以要在先下pip install plotly進行安裝

字形編碼問題

  • 在matplotlib 或 Seaborn 中,對於中文必須特別設定,否則圖不會顯示中文
  • 請採用以下設定,請參考

https://medium.com/marketingdatascience/%E8%A7%A3%E6%B1%BApython-3-matplotlib%E8%88%87seaborn%E8%A6%96%E8%A6%BA%E5%8C%96%E5%A5%97%E4%BB%B6%E4%B8%AD%E6%96%87%E9%A1%AF%E7%A4%BA%E5%95%8F%E9%A1%8C-f7b3773a889b

  • 建議:
    • Mac可用字體:SimHei
    • Windows 可用字體:Microsoft JhengHei

STEP1: 先用以下程式碼找到目前字體路徑,確認目前是抓哪個字體

In [1]:
from matplotlib.font_manager import findfont, FontProperties  
findfont(FontProperties(family=FontProperties().get_family())) 
Out[1]:
'C:\\Users\\user\\Anaconda3\\lib\\site-packages\\matplotlib\\mpl-data\\fonts\\ttf\\DejaVuSans.ttf'

STEP2:使用以下程式碼去尋找設定檔路徑

In [2]:
import matplotlib 
matplotlib.matplotlib_fname()
Out[2]:
'C:\\Users\\user\\Anaconda3\\lib\\site-packages\\matplotlib\\mpl-data\\matplotlibrc'
  • 1.找出 C:\Users\user\Anaconda3\lib\site-packages\matplotlib\mpl-data\matplotlibrc 檔案,然後用記事本打開
  • 2.將font.family與font.serif註解(#)移除,並在font.serif後方加入Microsoft JhengHei

STEP3: 刪除.matplotlib快取資料夾

  • 然後刪除.matplotlib 資料夾,因為.matplotlib 資料夾裡面有舊有的matplotlib 快取(cache)檔案,為避免更新後的字體在讀取到舊的快取,所有要刪除。

STEP4:放入字體至matplotlib指定字體路徑

  • 將msj文字檔(載點,為其他網友提供)下載,以msj命名之,在儲存到以下路徑資料夾 : C:\Users\您的使用者名稱\Anaconda3\Lib\site-packages\matplotlib\mpl-data\fonts\ttf

STEP5: 使用rcParams 參數指定字體

In [7]:
#測試一下
%matplotlib inline 
##將後續畫圖的結果直接顯現在網頁中

import numpy as np 
import matplotlib.pyplot as plt 
import pandas as pd  
import seaborn as sns #加入seaborn套件

from matplotlib.font_manager import FontProperties
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif'] = ['Microsoft JhengHei'] 
plt.rcParams['axes.unicode_minus'] = False
plt.plot((1,2,3),(4,3,-1))
plt.title("聲量圖")
plt.ylabel("文章數量")
plt.xlabel("品牌名稱") 
plt.show()

[標準開始動作]. 如果你用python的目標是科學計算或數值分析,基本是引⼊三個套件

In [5]:
%matplotlib inline 
##將後續畫圖的結果直接顯現在網頁中

import numpy as np 
import matplotlib.pyplot as plt 
import pandas as pd  

import seaborn as sns #加入seaborn套件

[Part 1]. 基本敘述性統計

tips(小費)數據集。小費數據集,是一個餐廳侍者收集的關於小費的數據,其中包含了七個變量,包括

  • total_bill: 總費用
  • tip: 付小費的金額
  • sex: 付款者性別
  • smoker: 是否吸菸
  • day: 日期
  • time: 給小費的時段
  • size: 顧客人數。

[主要分析目的]. 通過數據分析和建模,可幫助餐廳侍者預測來餐廳就餐的顧客是否會會支付小費。

In [9]:
tips = pd.read_csv('tips.csv')
In [10]:
tips.head() #看前五筆資料
Out[10]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
In [11]:
tips.tail() #看後面幾筆
Out[11]:
total_bill tip sex smoker day time size
239 29.03 5.92 Male No Sat Dinner 3
240 27.18 2.00 Female Yes Sat Dinner 2
241 22.67 2.00 Male Yes Sat Dinner 2
242 17.82 1.75 Male No Sat Dinner 2
243 18.78 3.00 Female No Thur Dinner 2
In [12]:
tips.describe() #基本統計量,只會列出屬"量"資料
Out[12]:
total_bill tip size
count 244.000000 244.000000 244.000000
mean 19.785943 2.998279 2.569672
std 8.902412 1.383638 0.951100
min 3.070000 1.000000 1.000000
25% 13.347500 2.000000 2.000000
50% 17.795000 2.900000 2.000000
75% 24.127500 3.562500 3.000000
max 50.810000 10.000000 6.000000

[Part 2]. Visualization to EDA

  • Seaborn 套件是以 matplotlib 為基礎建構的高階繪圖套件,讓使用者更加輕鬆地建立圖表,我們可以將它視為是 matplotlib 的補強
In [13]:
sns.set(style="ticks")
In [14]:
sns.scatterplot(x="total_bill", y="tip", 
                hue="time",
                data=tips)
Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x1cd67aadf98>

2-1.次數分配圖

In [15]:
sns.distplot(tips['total_bill'])
C:\Users\user\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x1cd67aadf60>
In [16]:
sns.distplot(tips['total_bill'],kde = False) # kde = Kernel Density
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x1cd67bfd320>

上圖顯示,顧客在餐廳的消費總金額主要是在5-35的範圍內分佈的(右偏分佈)

2-2.count plot

In [17]:
sns.countplot(x = 'smoker',  data = tips)
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x1cd67ca2d30>

來餐廳就餐的顧客,不抽菸者比會抽菸者多

In [18]:
sns.barplot(x="day", y="total_bill", data=tips)
Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x1cd67d02d30>
In [19]:
sns.countplot(x = 'time',  data = tips)
Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x1cd67d61dd8>

顧客來餐廳就餐,主要是來晚飯多一些,來吃午餐的總次數更少一些

In [20]:
sns.countplot(x = 'size',  data = tips)
Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x1cd67dbd668>

個人來餐廳就餐的總次數高於其他人次

In [21]:
sns.countplot(x = 'day',  data = tips)
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x1cd67e0af98>

消費次數來說,週六最高

2-3. Bar-plot

In [22]:
sns.barplot(x="day", y="total_bill", data=tips)
Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x1cd67e6cb38>

但顧客主要是週日、週六、週四來餐廳就餐,但消費金額是周日最高

In [24]:
sns.barplot(x="day", y="total_bill", hue="sex", data=tips)
Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x2697ac36198>

就餐時消費的賬單,這家餐廳男性的消費較多(男性買單的次數會比由女性買單的次數多一些)

In [25]:
sns.barplot(x="time", y="tip", data=tips,
            order=["Dinner", "Lunch"])
Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x2697acc2908>
In [26]:
from numpy import median
sns.barplot(x="day", y="tip", data=tips, estimator=median) #用每天的中位數去估計小費的高低
Out[26]:
<matplotlib.axes._subplots.AxesSubplot at 0x2697ad29780>

[Part 3].較進階的語法

kind : { “scatter” | “reg” | “resid” | “kde” | “hex” }, optional

Kind of plot to draw.

3-1. Joint-Plot

In [23]:
sns.jointplot(x = 'total_bill', y = 'tip', data = tips)
Out[23]:
<seaborn.axisgrid.JointGrid at 0x1cd67e197b8>
  • ### 顧客主要消費在10-30元之間,
  • ### 對應給侍者小費的錢在1-5元之間
In [24]:
sns.jointplot(x = 'total_bill', y = 'tip', data = tips,kind = 'hex' ) # 清晰地視覺化視圖,顏色的深度代表頻率
Out[24]:
<seaborn.axisgrid.JointGrid at 0x1cd67f085f8>
  • ### 消費總金額集中在10-20元之間,小費集中在1-3.5元之間
  • ### 但這並不能證明"消費金額越高,小費就給的越多"
In [25]:
sns.jointplot(x = 'total_bill', y = 'tip', data = tips ,kind = 'kde')
Out[25]:
<seaborn.axisgrid.JointGrid at 0x1cd67feba58>
In [26]:
sns.jointplot(x = 'total_bill', y = 'tip', data = tips ,kind = 'reg')
Out[26]:
<seaborn.axisgrid.JointGrid at 0x1cd6813cf60>

做一條簡單的迴歸線,它表明了小費的金額是隨着總賬單金額的增加而增加的

3-2.Pair-plot

我們可看不同變數之間的散佈圖狀況

In [27]:
sns.pairplot(tips)
Out[27]:
<seaborn.axisgrid.PairGrid at 0x1cd68297ba8>

數據集中在"消費總額、小費金額以及顧客數量"三個變量之間

In [28]:
sns.pairplot(tips ,hue ='sex', markers=["o", "s"]) #o是圓圈,s是方塊
#看看性別在不同變數間的關係
# hue ='sex' 以性別當圖標
Out[28]:
<seaborn.axisgrid.PairGrid at 0x1cd696a82b0>
  • ### 使用兩種不同的顏色用於區分性別
  • ### 這間餐廳的男性顧客較多,但性別是否造成給小費高低的差別,目前無法辨識

3-3. Box-plot

In [29]:
sns.boxplot(x = 'day', y= 'total_bill', data = tips)
Out[29]:
<matplotlib.axes._subplots.AxesSubplot at 0x1cd69c7ed30>

大部分賬單是在週六和週日支付的

In [30]:
sns.boxplot(x = 'day', y= 'total_bill', data = tips, hue = 'sex') # hue = 'sex'以性別當圖標
Out[30]:
<matplotlib.axes._subplots.AxesSubplot at 0x1cd6a044a20>
  • ### 只有在週六時,女性買單的次數會比男性多。
  • ### 但具體原因不知,需要質性調查

3-4.Cat-Plot

In [31]:
sns.set(style="ticks", color_codes=True)
In [32]:
help(sns.catplot)
Help on function catplot in module seaborn.categorical:

catplot(x=None, y=None, hue=None, data=None, row=None, col=None, col_wrap=None, estimator=<function mean at 0x000001CD630622F0>, ci=95, n_boot=1000, units=None, order=None, hue_order=None, row_order=None, col_order=None, kind='strip', height=5, aspect=1, orient=None, color=None, palette=None, legend=True, legend_out=True, sharex=True, sharey=True, margin_titles=False, facet_kws=None, **kwargs)
    Figure-level interface for drawing categorical plots onto a FacetGrid.
    
    This function provides access to several axes-level functions that
    show the relationship between a numerical and one or more categorical
    variables using one of several visual representations. The ``kind``
    parameter selects the underlying axes-level function to use:
    
    Categorical scatterplots:
    
    - :func:`stripplot` (with ``kind="strip"``; the default)
    - :func:`swarmplot` (with ``kind="swarm"``)
    
    Categorical distribution plots:
    
    - :func:`boxplot` (with ``kind="box"``)
    - :func:`violinplot` (with ``kind="violin"``)
    - :func:`boxenplot` (with ``kind="boxen"``)
    
    Categorical estimate plots:
    
    - :func:`pointplot` (with ``kind="point"``)
    - :func:`barplot` (with ``kind="bar"``)
    - :func:`countplot` (with ``kind="count"``)
    
    Extra keyword arguments are passed to the underlying function, so you
    should refer to the documentation for each to see kind-specific options.
    
    Note that unlike when using the axes-level functions directly, data must be
    passed in a long-form DataFrame with variables specified by passing strings
    to ``x``, ``y``, ``hue``, etc.
    
    As in the case with the underlying plot functions, if variables have a
    ``categorical`` data type, the the levels of the categorical variables, and
    their order will be inferred from the objects. Otherwise you may have to
    use alter the dataframe sorting or use the function parameters (``orient``,
    ``order``, ``hue_order``, etc.) to set up the plot correctly.
    
    This function always treats one of the variables as categorical and
    draws data at ordinal positions (0, 1, ... n) on the relevant axis, even
    when the data has a numeric or date type.
    
    See the :ref:`tutorial <categorical_tutorial>` for more information.    
    
    After plotting, the :class:`FacetGrid` with the plot is returned and can
    be used directly to tweak supporting plot details or add other layers.
    
    Parameters
    ----------
    x, y, hue : names of variables in ``data``
        Inputs for plotting long-form data. See examples for interpretation.        
    data : DataFrame
        Long-form (tidy) dataset for plotting. Each column should correspond
        to a variable, and each row should correspond to an observation.    
    row, col : names of variables in ``data``, optional
        Categorical variables that will determine the faceting of the grid.
    col_wrap : int, optional
        "Wrap" the column variable at this width, so that the column facets
        span multiple rows. Incompatible with a ``row`` facet.    
    estimator : callable that maps vector -> scalar, optional
        Statistical function to estimate within each categorical bin.
    ci : float or "sd" or None, optional
        Size of confidence intervals to draw around estimated values.  If
        "sd", skip bootstrapping and draw the standard deviation of the
        observations. If ``None``, no bootstrapping will be performed, and
        error bars will not be drawn.
    n_boot : int, optional
        Number of bootstrap iterations to use when computing confidence
        intervals.
    units : name of variable in ``data`` or vector data, optional
        Identifier of sampling units, which will be used to perform a
        multilevel bootstrap and account for repeated measures design.    
    order, hue_order : lists of strings, optional
        Order to plot the categorical levels in, otherwise the levels are
        inferred from the data objects.        
    row_order, col_order : lists of strings, optional
        Order to organize the rows and/or columns of the grid in, otherwise the
        orders are inferred from the data objects.
    kind : string, optional
        The kind of plot to draw (corresponds to the name of a categorical
        plotting function. Options are: "point", "bar", "strip", "swarm",
        "box", "violin", or "boxen".
    height : scalar, optional
        Height (in inches) of each facet. See also: ``aspect``.    
    aspect : scalar, optional
        Aspect ratio of each facet, so that ``aspect * height`` gives the width
        of each facet in inches.    
    orient : "v" | "h", optional
        Orientation of the plot (vertical or horizontal). This is usually
        inferred from the dtype of the input variables, but can be used to
        specify when the "categorical" variable is a numeric or when plotting
        wide-form data.    
    color : matplotlib color, optional
        Color for all of the elements, or seed for a gradient palette.    
    palette : palette name, list, or dict, optional
        Colors to use for the different levels of the ``hue`` variable. Should
        be something that can be interpreted by :func:`color_palette`, or a
        dictionary mapping hue levels to matplotlib colors.    
    legend : bool, optional
        If ``True`` and there is a ``hue`` variable, draw a legend on the plot.
    legend_out : bool, optional
        If ``True``, the figure size will be extended, and the legend will be
        drawn outside the plot on the center right.    
    share{x,y} : bool, 'col', or 'row' optional
        If true, the facets will share y axes across columns and/or x axes
        across rows.    
    margin_titles : bool, optional
        If ``True``, the titles for the row variable are drawn to the right of
        the last column. This option is experimental and may not work in all
        cases.    
    facet_kws : dict, optional
        Dictionary of other keyword arguments to pass to :class:`FacetGrid`.
    kwargs : key, value pairings
        Other keyword arguments are passed through to the underlying plotting
        function.
    
    Returns
    -------
    g : :class:`FacetGrid`
        Returns the :class:`FacetGrid` object with the plot on it for further
        tweaking.
    
    Examples
    --------
    
    Draw a single facet to use the :class:`FacetGrid` legend placement:
    
    .. plot::
        :context: close-figs
    
        >>> import seaborn as sns
        >>> sns.set(style="ticks")
        >>> exercise = sns.load_dataset("exercise")
        >>> g = sns.catplot(x="time", y="pulse", hue="kind", data=exercise)
    
    Use a different plot kind to visualize the same data:
    
    .. plot::
        :context: close-figs
    
        >>> g = sns.catplot(x="time", y="pulse", hue="kind",
        ...                data=exercise, kind="violin")
    
    Facet along the columns to show a third categorical variable:
    
    .. plot::
        :context: close-figs
    
        >>> g = sns.catplot(x="time", y="pulse", hue="kind",
        ...                 col="diet", data=exercise)
    
    Use a different height and aspect ratio for the facets:
    
    .. plot::
        :context: close-figs
    
        >>> g = sns.catplot(x="time", y="pulse", hue="kind",
        ...                 col="diet", data=exercise,
        ...                 height=5, aspect=.8)
    
    Make many column facets and wrap them into the rows of the grid:
    
    .. plot::
        :context: close-figs
    
        >>> titanic = sns.load_dataset("titanic")
        >>> g = sns.catplot("alive", col="deck", col_wrap=4,
        ...                 data=titanic[titanic.deck.notnull()],
        ...                 kind="count", height=2.5, aspect=.8)
    
    Plot horizontally and pass other keyword arguments to the plot function:
    
    .. plot::
        :context: close-figs
    
        >>> g = sns.catplot(x="age", y="embark_town",
        ...                 hue="sex", row="class",
        ...                 data=titanic[titanic.embark_town.notnull()],
        ...                 orient="h", height=2, aspect=3, palette="Set3",
        ...                 kind="violin", dodge=True, cut=0, bw=.2)
    
    Use methods on the returned :class:`FacetGrid` to tweak the presentation:
    
    .. plot::
        :context: close-figs
    
        >>> g = sns.catplot(x="who", y="survived", col="class",
        ...                 data=titanic, saturation=.5,
        ...                 kind="bar", ci=None, aspect=.6)
        >>> (g.set_axis_labels("", "Survival Rate")
        ...   .set_xticklabels(["Men", "Women", "Children"])
        ...   .set_titles("{col_name} {col_var}")
        ...   .set(ylim=(0, 1))
        ...   .despine(left=True))  #doctest: +ELLIPSIS
        <seaborn.axisgrid.FacetGrid object at 0x...>

In [33]:
sns.catplot(x="sex", y="total_bill",
            hue="smoker", col="time",
            data=tips, kind="bar",
            height=4, aspect=.7)
Out[33]:
<seaborn.axisgrid.FacetGrid at 0x1cd6a060860>
In [ ]: