sequencefiles_group_by_misfitコマンドマニュアル

(The documentation of sequencefiles_group_by_misfit command)

Last Update: 2024/5/15

◆機能・用途(Purpose)

複数の時系列データファイルを読み込んで波形のずれ具合によりグルーピングする。
Read time series data files and group them based on waveform misfits.

波形の類似度を用いたグルーピングには相関係数が指標として用いられることが多い。しかし相関係数での類似度評価が適さない波形もある。例えば振動ではなく単調に増加(または減少)する地殻変動などの波形同士の比較においては、時間関数の形が異なっていても「単調増加である」という特徴が共通しているだけで相関係数の値が大きくなりやすい。具体的な例として、区間$t\in [0,1]$で定義された2つの時間関数 $f(t)=t^2$と$g(t)=\sqrt{t}$は前者が下に凸(加速型)、後者が上に凸(減速型)であり、形は大きく異なる。にも関わらず、この2つの関数の相関係数は0.9を超える ( 関数sequencefiles_group_by_correlationのマニュアル参照)。このような波形を「異なる」と判断するには相関係数に代わる別の指標が必要である。
Correlation coefficients are frequently used in the grouping of similar waveforms. However, the correlation coefficients are not appropriate as a measure of similarity for some waveforms. For example, the correlation coefficients tend to be large between monotonically increasing (or decreasing) waveforms, even though their shapes are obviously different; this situation is usual in case of observing a crustal deformation instead of oscillating seismic waves. As an example, consider two time functions $f(t)=t^2$ and $g(t)=\sqrt{t}$ defined in $t\in [0,1]$. The shapes of them are obviously different; the former is bended downward (an acceleration type), while the latter is bended upward (a deceleration type). Nevertheless, the correlation coefficient between them is larger than 0.9; see the documentation of function sequencefiles_group_by_correlation. To identify these waveforms being different, a measure of similarity other than the correlation coefficient is needed.

このプログラムでは2つの時間関数$f(t)$, $g(t)$の類似度をずれ具合 \[\begin{equation} M\equiv \sqrt{\frac{\int [f(t)-g(t)]^2 dt} {(1/2)\int [f(t)^2+g(t)^2] dt}} \label{eq.misfit} \end{equation}\] によって評価する。これは波形インバージョンで用いる残差に類似の量であり、値が小さいほど類似度は高くなる。 $f(t)=g(t)$の場合に0、$f(t)\neq g(t)=0$の場合に$\sqrt{2}$、 $f(t)=-g(t)$の場合に2である。また$f(t)$と$g(t)$が互いに無相関なランダムな波形である場合には確率的に \[\begin{equation} \int f(t)^2 dt \sim \int g(t)^2 dt \sim \int\left[f(t)-g(t)\right]^2 dt \label{eq.random} \end{equation}\] となり、$M\sim 1$となることが予想される。
This program quantifies the similarity between two time functions $f(t)$ and $g(t)$ by a misfit defined by Eq. (\ref{eq.misfit}). This quantity is analogous to residuals used in waveform inversion analyses; smaller values of the misfit correspond to higher levels of the similarity. The misfit is 0 if $f(t)=g(t)$, $\sqrt{2}$ if $f(t)\neq g(t)=0$, and 2 if $f(t)=-g(t)$. If $f(t)$ and $g(t)$ are random waveforms that are not mutually correlated, Eq. (\ref{eq.random}), and thus $M\sim 1$, is expected.

◆ソースコード(Source code)

$YMAEDA_OPENTOOL_DIR/sequence/src/sequencefiles_group_by_correlation.c

◆使用方法(Usage)

コマンドライン引数でパラメータを指定する。パラメータの一覧を下表に示す。
Specify parameters by command-line arguments. The table below shows a list of parameters.

●「-」から始まらない引数 (Arguments not beginning with “-”)

このコマンドでは「-」から始まらない引数は存在しない。
This command does not have arguments not beginning with “-”.

●1つの「-」から始まる引数 (Arguments beginning with a single “-”)

このコマンドでは1つの「-」から始まる引数は存在しない。
This command does not have arguments beginning with a single “-”.

●「--パラメータ名=パラメータ値」の形式の引数 (Arguments of a form “--Parameter name=Parameter Value”)

「--パラメータ名=パラメータ値」の形式の引数は自由な順番で指定できる。「-」から始まらない引数の間に挿入しても良い。相反する指定がなされた場合には後の指定が優先される。デフォルト値を持つパラメータは省略できる。
Arguments of a form “--Parameter name=Parameter Value” can be placed in an arbitrary order. They can even be inserted between arguments not beginning with “-”. In case of conflicting options being specified, the latter option has a higher priority. Parameters that have default values can be omitted.

パラメータ名 Parameter name	意味 Meaning	可能なパラメータ値 Allowed parameter values	デフォルト値 Default value
datadir	使用する時系列データが格納されているディレクトリパス。 The path of the directory in which the time series data to be used are stored.	ディレクトリパスを表す文字列。絶対パスでも相対パスでも良い。 A string that represents a directory path (either absolute or relative).	.
timeSeriesData_listFile	使用する時系列データを列挙したテキストファイルの名前。 The name of a text file in which the time series data to be used are listed. このファイル内には1行につき1つずつ、使用する時系列データファイル名 (パラメータdatadirで指定したディレクトリからの相対パス) を記載する。空行や各行の#から後の部分はコメントとして無視されるので自由に挿入できる。 In this file, write a time series data file name (the relative path from the directory specified by parameter datadir) in each line. Empty lines and comments after # in each line can arbitrarily be inserted, as they are simply ignored. なお、個々の時系列データファイルは ymaeda_opentoolsの時系列データファイル形式 (独自のファイル形式参照) でなければならない。また、時系列データの定義域とサンプリングレートは全てのファイルで共通でなければならない。 Each time series data file must have one of the formats of a time series data of ymaeda_opentools (see special file formats). The definition range and sampling rate must be common among all time series data.	ファイル名を表す文字列。 A string that represents a file name.	省略不可 Cannot be omitted
threshold	波形のずれ具合(\ref{eq.misfit}式)の閾値。 2つの時系列データのずれ具合がこの値未満であれば同じグループに分類する。 A threshold value for the waveform misfit (Eq. \ref{eq.misfit}); if the misfit between two time series data is less than this value, the two data are labeled as the same group.	正の実数。 A positive real number.	0.2
outputfile	出力ファイル名。グルーピングの結果が出力される。 The name of the output file in which the result of the grouping is written.	ファイル名を表す文字列。 A string that represents a file name.	省略不可 Cannot be omitted

◆動作(Behaviour)

パラメータtimeSeriesData_listFileで指定したテキストファイルに列挙されている時系列データを読み込み、それらを波形のずれ具合(\ref{eq.misfit}式)に基づいてグルーピングし、結果をパラメータoutputfileで指定したファイルに出力する。
Read the time series data listed in the text file specified by parameter timeSeriesData_listFile, group them based on waveform misfits (Eq. \ref{eq.misfit}), and output the result into the file specified by parameter outputfile.

出力ファイルは2列から成り、 1列目が時系列データファイル名、 2列目がグループ番号(1から始まる連番)である。列の区切りにはタブが用いられる。先頭部のコメント行にグループ数と波形のずれ具合の閾値が出力される。
The output file is composed of two columns; the time series data file name is in the 1st column, and the group (a consecutive number starting from 1) is in the 2nd column. A tab is used to separate the columns. The number of groups and the threshold value of the waveform misfit are written in the comment lines at the top.

◆使用例(Example)

sequencefiles_group_by_misfit --datadir=data --timeSeriesData_listFile=file_list.dat --outputfile=groups.dat

(file_list.dat)

a.seq2
b.seq2
c.seq2
d.seq2
e.seq2
f.seq2
g.seq2
h.seq2
i.seq2
j.seq2
k.seq2
l.seq2
m.seq2
n.seq2

この例ではディレクトリdataの中にある a.seq2からn.seq2までの14個の時系列データファイルを読み込み、例えば以下のようなファイルが出力される。ここで[TAB]はタブを表す。
In this example, 14 time series data files from a.seq2 to n.seq2 are read from the directory data. The output file is, for example, as follows, where [TAB] represents a tab.

(group.dat)

#Ngroups: 5
#threshold: 0.200000
data/a.seq2[TAB]2
data/b.seq2[TAB]1
data/c.seq2[TAB]3
data/d.seq2[TAB]1
data/e.seq2[TAB]2
data/f.seq2[TAB]1
data/g.seq2[TAB]1
data/h.seq2[TAB]1
data/i.seq2[TAB]2
data/j.seq2[TAB]3
data/k.seq2[TAB]1
data/l.seq2[TAB]4
data/m.seq2[TAB]5
data/n.seq2[TAB]1

◆アルゴリズム(The algorithm)

波形のずれ具合(\ref{eq.misfit}式)を類似度の指標として用いる点を除いてグルーピングの手法は関数sequencefiles_group_by_correlation と同様である。まず、ずれ具合が閾値を下回るペアを最も多く有する時系列データをマスターとして選び、その時系列データとのずれ具合が閾値を下回る時系列データをグループ1に分類する。残りの時系列データに対して同様の処理を行ってグループ2を定義する。以下同様にしてグループ3,4,5,…と順に定義する。
The grouping method is same as that of function sequencefiles_group_by_correlation except that the misfit (Eq. \ref{eq.misfit}) is used as a measure of the similarity between two waveforms. First, select the time series data (a master time series data) that has the largest number of pairs that have misfits below the threshold value. All time series data that have misfits below the threshold value with this master time series data are labeled as group 1. Repeat the same procedure to the remaining time series data to define group 2. Groups 3, 4, 5, …, are sequentially defined in the same way.

◆補足(Additional notes)

相関係数を用いると下に凸(加速型)と上に凸(減速型)の単調増加関数が形状の違いにも関わらず「似ている」と判断されてしまうこと、この問題の回避のために相関係数の代わりに(\ref{eq.misfit})式の$M$を用いることを述べた。実際に$M$を用いると形状の違いをどの程度評価できるのかを検証する。
Monotonically increasing two functions, one that bends downward (an acceleration type) and the other that bends upward (a deceleration type), are evaluated as “similar’ by correlation coefficients regardless of the difference in shapes. The purpose of this program is to solve this issue by introducing another measure (misfit $M$ in Eq. \ref{eq.misfit}) for evaluating the waveform similarity. Below, it is examined how the difference in shapes can be evaluated by $M$.

関数sequencefiles_group_by_correlation と同様に区間$t\in [0,1]$で定義された3つの関数 $f_1(t)=t$, $f_2(t)=t^2$, $f_3(t)=\sqrt{t}$ を考え、これらの関数間の類似度をずれ具合$M$を用いて評価する。 $f_1(t)$と$f_2(t)$のずれ具合は \[\begin{eqnarray} M_{12} &=& \sqrt{\frac{\int_0^1 [f_1(t)-f_2(t)]^2 dt} {(1/2)\int_0^1 [f_1(t)^2+f_2(t)^2] dt}} \nonumber \\ &=& \sqrt{\frac{2\int_0^1 (t-t^2)^2 dt} {\int_0^1 (t^2+t^4) dt}} \nonumber \\ &=& \sqrt{\frac{2\int_0^1 (t^2-2t^3+t^4) dt} {\int_0^1 (t^2+t^4) dt}} \nonumber \\ &=& \sqrt{\frac{2\left[\frac{1}{3}t^3-\frac{1}{2}t^4+\frac{1}{5}t^5\right]_0^1} {\left[\frac{1}{3}t^3+\frac{1}{5}t^5\right]_0^1}} \nonumber \\ &=& \sqrt{\frac{2\left(\frac{1}{3}-\frac{1}{2}+\frac{1}{5}\right)} {\frac{1}{3}+\frac{1}{5}}} \nonumber \\ &=& \sqrt{\frac{2\cdot \frac{10-15+6}{30}} {\frac{5+3}{15}}} \nonumber \\ &=& \sqrt{\frac{\frac{1}{15}} {\frac{8}{15}}} \nonumber \\ &=& \sqrt{\frac{1}{8}} \nonumber \\ &=& \frac{1}{2\sqrt{2}} \nonumber \\ &\sim& 0.3536 \label{eq.M12} \end{eqnarray}\] $f_1(t)$と$f_3(t)$のずれ具合は \[\begin{eqnarray} M_{13} &=& \sqrt{\frac{\int_0^1 [f_1(t)-f_3(t)]^2 dt} {(1/2)\int_0^1 [f_1(t)^2+f_3(t)^2] dt}} \nonumber \\ &=& \sqrt{\frac{2\int_0^1 (t-\sqrt{t})^2 dt} {\int_0^1 (t^2+t) dt}} \nonumber \\ &=& \sqrt{\frac{2\int_0^1 (t^2-2t^{3/2}+t) dt} {\int_0^1 (t^2+t) dt}} \nonumber \\ &=& \sqrt{\frac{2\left[\frac{1}{3}t^3-\frac{4}{5}t^{5/2}+\frac{1}{2}t^2 \right]_0^1} {\left[\frac{1}{3}t^3+\frac{1}{2}t^2\right]_0^1}} \nonumber \\ &=& \sqrt{\frac{2\left(\frac{1}{3}-\frac{4}{5}+\frac{1}{2}\right)} {\frac{1}{3}+\frac{1}{2}}} \nonumber \\ &=& \sqrt{\frac{2\cdot \frac{10-24+15}{30}} {\frac{2+3}{6}}} \nonumber \\ &=& \sqrt{\frac{\frac{1}{15}} {\frac{5}{6}}} \nonumber \\ &=& \sqrt{\frac{2}{25}} \nonumber \\ &=& \frac{\sqrt{2}}{5} \nonumber \\ &\sim& 0.2828 \label{eq.M13} \end{eqnarray}\] $f_2(t)$と$f_3(t)$のずれ具合は \[\begin{eqnarray} M_{23} &=& \sqrt{\frac{\int_0^1 [f_2(t)-f_3(t)]^2 dt} {(1/2)\int_0^1 [f_2(t)^2+f_3(t)^2] dt}} \nonumber \\ &=& \sqrt{\frac{2\int_0^1 (t^2-\sqrt{t})^2 dt} {\int_0^1 (t^4+t) dt}} \nonumber \\ &=& \sqrt{\frac{2\int_0^1 (t^4-2t^{5/2}+t) dt} {\int_0^1 (t^4+t) dt}} \nonumber \\ &=& \sqrt{\frac{2\left[\frac{1}{5}t^5-\frac{4}{7}t^{7/2}+\frac{1}{2}t^2 \right]_0^1} {\left[\frac{1}{5}t^5+\frac{1}{2}t^2\right]_0^1}} \nonumber \\ &=& \sqrt{\frac{2\left(\frac{1}{5}-\frac{4}{7}+\frac{1}{2}\right)} {\frac{1}{5}+\frac{1}{2}}} \nonumber \\ &=& \sqrt{\frac{2\cdot \frac{14-40+35}{70}} {\frac{2+5}{10}}} \nonumber \\ &=& \sqrt{\frac{\frac{9}{35}} {\frac{7}{10}}} \nonumber \\ &=& \sqrt{\frac{18}{49}} \nonumber \\ &=& \frac{3\sqrt{2}}{7} \nonumber \\ &\sim& 0.6061 \label{eq.M23} \end{eqnarray}\] であり、これらを関数sequencefiles_group_by_correlationのマニュアルで評価した相関係数と比較すると表1のようになる。表1より、$M$を用いると相関係数よりも $f_1(t)$, $f_2(t)$, $f_3(t)$ の違いを評価しやすいことが分かる。
Consider three functions $f_1(t)=t$, $f_2(t)=t^2$, $f_3(t)=\sqrt{t}$ defined in a section $t\in [0,1]$ (same as those used in function sequencefiles_group_by_correlation), and evaluate their similarities using the misfit $M$. The misfits between $f_1(t)$ and $f_2(t)$, $f_1(t)$ and $f_3(t)$, and $f_2(t)$ and $f_3(t)$ are calculated as Eqs. (\ref{eq.M12})-(\ref{eq.M23}). Table 1 compares these results with the correlation coefficients evaluate in the documentation of function sequencefiles_group_by_correlation). The table indicates that $M$ is more sensitive to the difference among $f_1(t)$, $f_2(t)$, $f_3(t)$ than the correlation coefficients.

表1. ずれ具合$M$と相関係数の比較。
Table 1. Comparison of misfits $M$ and correlation coefficients.

$f(t)$, $g(t)$	$M$	$1-M$	相関係数 Correlation coefficient
$f(t)=g(t)$	$0$	$1$	$1$
ランダムで無相関 Random and not correlated	$1$	$0$	$0$
$f(t)=-g(t)$	$2$	$-1$	$-1$
$f(t)=t$, $g(t)=t^2$ ($t\in [0,1]$)	$\frac{1}{2\sqrt{2}}\sim 0.3536$	$0.6464$	$\frac{\sqrt{15}}{4}\sim 0.9682$
$f(t)=t$, $g(t)=\sqrt{t}$ ($t\in [0,1]$)	$\frac{\sqrt{2}}{5}\sim 0.2828$	$0.7172$	$\frac{2\sqrt{6}}{5}\sim 0.9798$
$f(t)=t^2$, $g(t)=\sqrt{t}$ ($t\in [0,1]$)	$\frac{3\sqrt{2}}{7}\sim 0.6061$	$0.3939$	$\frac{2\sqrt{10}}{7}\sim 0.9035$

\(f(t)\), \(g(t)\)	\(M\)	\(1-M\)	相関係数 Correlation coefficient
\(f(t)=g(t)\)	\(0\)	\(1\)	\(1\)
ランダムで無相関 Random and not correlated	\(1\)	\(0\)	\(0\)
\(f(t)=-g(t)\)	\(2\)	\(-1\)	\(-1\)
\(f(t)=t\), \(g(t)=t^2\) (\(t\in [0,1]\))	\(\frac{1}{2\sqrt{2}}\sim 0.3536\)	\(0.6464\)	\(\frac{\sqrt{15}}{4}\sim 0.9682\)
\(f(t)=t\), \(g(t)=\sqrt{t}\) (\(t\in [0,1]\))	\(\frac{\sqrt{2}}{5}\sim 0.2828\)	\(0.7172\)	\(\frac{2\sqrt{6}}{5}\sim 0.9798\)
\(f(t)=t^2\), \(g(t)=\sqrt{t}\) (\(t\in [0,1]\))	\(\frac{3\sqrt{2}}{7}\sim 0.6061\)	\(0.3939\)	\(\frac{2\sqrt{10}}{7}\sim 0.9035\)

sequencefiles_group_by_misfitコマンド マニュアル