Suppose you are given a sequence of $n$ data points $x_i$ for $i = 1, 2, \ldots, n$ which are corrupted by noise. Suppose also that you do not have any knowledge regarding what kind of shape the original data had before noise was added to it, i.e., that you do not know which type of function (e.g. sine, linear, etc.) would model the original data well. This post describes a technique which can be used to denoise data in this type of situation. An implementation of the algorithm presented below can be found here.

Our goal here is to determine $n$ new data points $\tilde{x}_i$ which achieve the two goals below:

1. | each value $\tilde{x}_i$ should not be too far from the original value $x_i$ it approximates |

2. | the sequence of points $\tilde{x}_i$ should be smooth |

One way to satisfy the first goal is by not allowing the Euclidean distance between the two sequences to be too large. In other words, we would like to keep $$ \|{\bf \tilde{x}} - {\bf x}\|^2_2 = \sum_{i=1}^n(\tilde{x}_i - x_i)^2 \label{post_df0bdbd936cfac191141770bf91a6b6e_term1} $$ small, where ${\bf x}$ and ${\bf\tilde{x}}$ are vectors which hold the sequence of values $x_i$ and $\tilde{x}_i$ respectively for $i = 1, 2, \ldots, n$. However, if we force this term to be too small, we will end up forcing each $\tilde{x}_i$ to be too close to the value $x_i$ it approximates, but this means we are doing nothing but generating the original noisy sequence. As an extreme case, we can take ${\bf \tilde{x}} = {\bf x}$, in which case $\|{\bf \tilde{x}} - {\bf x}\|_2$ will be identically zero but nothing useful will have been achieved.

Now let us focus on the second goal. Noisy data is often characterized by large differences between adjacent values, i.e., $x_i$ and $x_{i+1}$ can be very different. When the data is smooth, $x_i \approx x_{i+1}$, so ideally our generated denoised sequence $\tilde{x}_i$ will be such that $$ \sum_{i=1}^{n-1}(\tilde{x}_{i+1} - \tilde{x}_i)^2 \label{post_df0bdbd936cfac191141770bf91a6b6e_term2} $$ is also not too large. Since the values $x_i$ are usually the results of measurements which are conducted at regular time intervals (e.g. every $\Delta{t}$ seconds), we can interpret the minimization of the term above as an attempt to make the approximate derivative values of the sequence $\tilde{x}_i$ be small: $$ \sum_{i=1}^{n-1}(\tilde{x}_{i+1} - \tilde{x}_i)^2 = (\Delta{t})^2\sum_{i=1}^{n-1} \left(\frac{\tilde{x}_{i+1} - \tilde{x}_i}{\Delta{t}}\right)^2 $$ Notice that a perfect minimization of this sum is achieved when all $\tilde{x}_i$ are equal, i.e., $\tilde{x}_1 = \tilde{x}_2 = \ldots = \tilde{x}_n = c$, but a constant sequence is of little value to us unless the original sequence itself is approximately constant.

Collecting the facts above, we see that our desired goals are in direct conflict with each other, so our best approach to the problem is to compromise on both sides and provide a solution which achieves each goal only partially but well enough for our purposes. One good strategy is to minimize the sum of both terms in \eqref{post_df0bdbd936cfac191141770bf91a6b6e_term1} and \eqref{post_df0bdbd936cfac191141770bf91a6b6e_term2} simultaneously, i.e., minimize the following cost function: $$ J_\mu({\bf x}, {\bf\tilde{x}}) = \sum_{i=1}^n(\tilde{x}_i - x_i)^2 + \mu\sum_{i=1}^{n-1}(\tilde{x}_{i+1} - \tilde{x}_i)^2 \label{post_df0bdbd936cfac191141770bf91a6b6e_cost_func} $$ The factor $\mu$ allows us to tune how much we want to denoise the original data. If $\mu = 0$, we will only minimize the first term and generate the original sequence $\tilde{x}_i = x_i$. As $\mu$ becomes larger, the second term becomes more important and, in the limit $\mu \rightarrow \infty$, minimizing $J_\mu({\bf x}, {\bf\tilde{x}})$ becomes equivalent to minimizing this second term; as we have seen above, the generated sequence becomes then a constant sequence $\tilde{x}_i = c$ for $i = 1,2,\ldots,n$.

Since we now have reduced our problem to finding the vector ${\bf\tilde{x}}$ which minimizes the value of $J_\mu({\bf x}, {\bf\tilde{x}})$ ($\mu$ is a constant here), we must now find a way to solve this minimization problem. Luckily, as we will see, this is nothing more than a linear least squares problem in disguise. Indeed, if we define: $$ {\bf b} = \left( \begin{matrix} {\bf x} \\ {\bf 0}_{n-1} \end{matrix} \right) \label{post_df0bdbd936cfac191141770bf91a6b6e_b} $$ where ${\bf 0}_{n-1}$ is the $(n-1)$-dimensional zero vector and also $$ D_{n-1} = \left(\begin{matrix} -1 & 1 & 0 & 0 & \ldots & 0 \\ 0 & -1 & 1 & 0 & \ldots & 0 \\ 0 & 0 & -1 & 1 & \ldots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 0 & 0 & 0 & \ldots & -1 & 1 \end{matrix}\right) \label{post_df0bdbd936cfac191141770bf91a6b6e_D} $$ where $D_{n-1}$ is an $(n-1) \times n$ matrix which can be interpreted as a "differentiation matrix", then we can rewrite equation \eqref{post_df0bdbd936cfac191141770bf91a6b6e_cost_func} as below: $$ J_\mu({\bf x}, {\bf\tilde{x}}) = \|A{\bf\tilde{x}} - {\bf b}\|^2_2 $$ where: $$ A = \left(\begin{matrix} I_{n\times n} \\ \sqrt{\mu}D_{n-1}\end{matrix}\right) \label{post_df0bdbd936cfac191141770bf91a6b6e_A} $$ is a $(2n-1) \times n$ matrix, with $I_{n \times n}$ being the $n$-dimensional identity matrix. Indeed, from equations \eqref{post_df0bdbd936cfac191141770bf91a6b6e_b}, \eqref{post_df0bdbd936cfac191141770bf91a6b6e_D} and \eqref{post_df0bdbd936cfac191141770bf91a6b6e_A}, we have that: $$ A{\bf\tilde{x}} - {\bf b} = \left(\begin{matrix} I_{n\times n} \bf\tilde{x} - {\bf x}\\ \sqrt{\mu}D_{n-1}\bf\tilde{x}\end{matrix}\right) = \left(\begin{matrix} \bf\tilde{x} - {\bf x}\\ \sqrt{\mu}D_{n-1}\bf\tilde{x}\end{matrix}\right) \label{post_df0bdbd936cfac191141770bf91a6b6e_Ax_tilde_b} $$ and the square of the Euclidean norm of the vector in equation \eqref{post_df0bdbd936cfac191141770bf91a6b6e_Ax_tilde_b} is equal to $J_\mu({\bf x}, {\bf\tilde{x}})$. At this stage, all we need to do is use any of the readily-available linear least squares solving algorithms to determine ${\bf\tilde{x}}$ and we are done.

Although somewhat outside the scope of this post, it is worth mentioning that finding the vector ${\bf\tilde{x}}$ which minimizes $J_\mu({\bf x}, {\bf\tilde{x}})$ is equivalent to solving the following linear system (these are called the normal equations for this linear least squares problem): $$ (A^T A){\bf\tilde{x}} = A^T{\bf b} \label{post_df0bdbd936cfac191141770bf91a6b6e_linsys} $$ This linear system has a unique solution since $A^T A$ is invertible. Indeed, from the definition of $A$ in equation \eqref{post_df0bdbd936cfac191141770bf91a6b6e_A}, we must have $A{\bf y} \neq {\bf 0}$ unless ${\bf y} = {\bf 0}$, so if $A^T A{\bf y} = {\bf 0}$, we must as well have: $$ {\bf y}^TA^T A{\bf y} = (A{\bf y})^T(A{\bf y}) = \|A{\bf y}\|^2_2 = {\bf 0} \Longrightarrow A{\bf y} = {\bf 0} $$ and therefore ${\bf y} = {\bf 0}$, so $A^T A{\bf y} = {\bf 0}$ if and only if ${\bf y} = {\bf 0}$ (i.e., $A^TA$ is invertible).

Now we can summarize our method for producing a denoised version ${\bf\tilde x}$ of a noisy sequence ${\bf x}$:

1. | compute ${\bf b}$ and $A$ as given by equations \eqref{post_df0bdbd936cfac191141770bf91a6b6e_b} and \eqref{post_df0bdbd936cfac191141770bf91a6b6e_A} respectively |

2. | solve the linear system given on equation \eqref{post_df0bdbd936cfac191141770bf91a6b6e_linsys}; its solution ${\bf\tilde{x}}$ is the denoised version of the original sequence ${\bf x}$ |

If the final solution is either too smooth or not smooth enough, try using a different values of $\mu$: increasing $\mu$ will make the generated sequence ${\bf\tilde{x}}$ smoother, while smaller values of $\mu$ will enforce less smoothing, i.e., more confidence is put on the precision of the original data ${\bf x}$.

Figure 1 shows a typical result for the algorithm described above. Here, Gaussian noise is added to data representing a sine wave to simulate a noisy data sequence. The algorithm we developed is then used to produce a denoised version of the noisy sequence.

Fig. 1: | Gaussian noise with mean zero and standard deviation $\sigma = 0.1$ is added to a pure sine wave. The algorithm described above is then used to (partially) filter out this added noise component. |

### Extra notes on the differentiation term

As mentioned above, $D_{n-1}{\bf x}$ represents a derivative which is computed at each point $x_i$ for $i = 2, 3, \ldots, n$. This is a first order accurate backward difference approximation to the derivatives computed. We can also use different formulas to approximate the derivatives. For instance, if we use the second order accurate central difference formula, we will get the following cost function: $$ J^c_\mu({\bf x}, {\bf\tilde{x}}) = \sum_{i=1}^n(\tilde{x}_i - x_i)^2 + \mu\sum_{i=2}^{n-1}(\tilde{x}_{i+1} - \tilde{x}_{i-1})^2 \label{post_df0bdbd936cfac191141770bf91a6b6e_cost_func2} $$ The differentiation matrix associated with $J^c_\mu({\bf x}, {\bf\tilde{x}})$ is now an $(n-2)\times n$ matrix: $$ D^c_{n-2} = \left(\begin{matrix} -1 & 0 & 1 & 0 & \ldots & \ldots & 0 \\ 0 & -1 & 0 & 1 & \ldots & \ldots & 0 \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 0 & 0 & 0 & \ldots & -1 & 0 & 1 \end{matrix}\right) \label{post_df0bdbd936cfac191141770bf91a6b6e_Dc} $$ Figure 2 shows the results obtained for the same data sequence from figure 1 and the same $\mu$ value.

Fig. 2: | Results obtained when the cost function given in equation \eqref{post_df0bdbd936cfac191141770bf91a6b6e_cost_func2} is used. |

The results look rather odd. Where is this oscillation on the denoised sequence coming from? Well, consider first the original matrix $D_{n-1}$. If $D_{n-1}{\bf z} = {\bf 0}$ for some $n$-dimensional vector ${\bf z}$, then by the definition of $D_{n-1}$ we must have $z_2 - z_1 = 0$, $\ldots$, $z_n - z_{n-1} = 0$, i.e., $z_1 = z_2 = \ldots = z_n$, so the only vector in the null space of $D_{n-1}$ is the constant vector (as should be the case for a good derivative operator). But $D^c_{n-2}$ suffers from a bad disease: $D^c_{n-2}{\bf z} = {\bf 0}$ implies only that $z_1 = z_3$, $z_2 = z_4$, $z_3 = z_5$, $z_4 = z_6$ and so on, meaning as long as all the odd and even components of ${\bf z}$ are set to two distinct constants $c_{\textrm{odd}}$ and $c_{\textrm{even}}$ respectively, the resulting vector will be in the null space of $D^c_{n-2}$. But this means a highly oscillatory vector such as ${\bf z} = (1,-1,1,-1,\ldots)$ is in the null space of $D^c_{n-2}$ and therefore such a vector produces no error on the second term of equation \eqref{post_df0bdbd936cfac191141770bf91a6b6e_cost_func2}. This explains why the resulting "denoised sequence" is plagued by a clearly visible oscillatory noise which is actually noticeable even for large values of $\mu$.

Instead of trying to generate a smooth sequence by limiting the values which the first derivative of the denoised sequence can have, we can instead try limiting the values of its second derivative. Using the second order accurate central difference approximation to the second derivative, we get the following cost function: $$ J^{c,2}_\mu({\bf x}, {\bf\tilde{x}}) = \sum_{i=1}^n(\tilde{x}_i - x_i)^2 + \mu\sum_{i=2}^{n-1}(\tilde{x}_{i+1} - 2\tilde{x}_i + \tilde{x}_{i-1})^2 \label{post_df0bdbd936cfac191141770bf91a6b6e_cost_func3} $$ The resulting differentiation matrix has dimensions $(n-2)\times n$ and has the same form as the discrete Laplacian operator in one dimension: $$ L_{n-2} = \left(\begin{matrix} 1 & -2 & 1 & 0 & \ldots & \ldots & 0 \\ 0 & 1 & -2 & 1 & \ldots & \ldots & 0 \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 0 & 0 & 0 & \ldots & 1 & -2 & 1 \end{matrix}\right) \label{post_df0bdbd936cfac191141770bf91a6b6e_Lc} $$ Figure 3 shows the results obtained for the same data sequence from figure 1 and the same $\mu$ value. They are clearly better than the results show in figure 2 because the null space of $L_{n-2}$ does not contain highly oscillatory vectors (but this fact will not be proved here). Notice that since in this case we are not trying to limit variations in the denoised data (i.e., the distance between $\tilde{x}_i$ and $\tilde{x}_{i+1}$), but variations on the first derivative values instead, the resulting sequence ${\bf\tilde{x}}$ tends to be more linear than the one from figure 1.

Fig. 3: | Results obtained when the cost function given in equation \eqref{post_df0bdbd936cfac191141770bf91a6b6e_cost_func3} is used. |

## Comments

No comments posted yet.