A Teaser Example

Let’s start with a very reduced example that highlights some of the key capabilities of physics-based learning approaches. Let’s assume our physical model is a very simple equation: a parabola along the positive x-axis.

Despite being very simple, for every point along there are two solutions, i.e. we have two modes, one above the other one below the x-axis, as shown on the left below. If we don’t take care a conventional learning approach will give us an approximation like the red one shown in the middle, which is completely off. With an improved learning setup, ideally, by using a discretized numerical solver, we can at least accurately represent one of the modes of the solution (shown in green on the right).

_images/intro-teaser-side-by-side.png

Fig. 2 Side by side - supervised versus differentiable physics training.

Differentiable physics

One of the key concepts of the following chapters is what we’ll call differentiable physics (DP). This means that we use domain knowledge in the form of model equations, and then integrate discretized versions of these models into the training process. As implied by the name, having differentiable formulations is crucial for this process to support the training of neural networks.

Let’s illustrate the properties of deep learning via DP with the following example: We’d like to find an unknown function \(f^*\) that generates solutions from a space \(Y\), taking inputs from \(X\), i.e. \(f^*: X \to Y\). In the following, we’ll often denote idealized, and unknown functions with a \(*\) superscript, in contrast to their discretized, realizable counterparts without this superscript.

Let’s additionally assume we have a generic differential equation \(\mathcal P^*: Y \to Z\) (our model equation), that encodes a property of the solutions, e.g. some real world behavior we’d like to match. Later on, \(P^*\) will often represent time evolutions, but it could also be a constraint for conservation of mass (then \(\mathcal P^*\) would measure divergence). But to keep things as simple as possible here, the model we’ll look at in the following is a mapping back to the input space \(X\), i.e. \(\mathcal P^*: Y \to X\).

Using a neural network \(f\) to learn the unknown and ideal function \(f^*\), we could turn to classic supervised training to obtain \(f\) by collecting data. This classical setup requires a dataset by sampling \(x\) from \(X\) and adding the corresponding solutions \(y\) from \(Y\). We could obtain these, e.g., by classical numerical techniques. Then we train the NN \(f\) in the usual way using this dataset.

In contrast to this supervised approach, employing a differentiable physics approach takes advantage of the fact that we can often use a discretized version of the physical model \(\mathcal P\) and employ it to guide the training of \(f\). I.e., we want \(f\) to be aware of our simulator \(\mathcal P\), and to interact with it. This can vastly improve the learning, as we’ll illustrate below with a very simple example (more complex ones will follow later on).

Note that in order for the DP approach to work, \(\mathcal P\) has to be differentiable, as implied by the name. These differentials, in the form of a gradient, are what’s driving the learning process.

Finding the inverse function of a parabola

To illustrate the difference of supervised and DP approaches, we consider the following simplified setting: Given the function \(\mathcal P: y\to y^2\) for \(y\) in the interval \([0,1]\), find the unknown function \(f\) such that \(\mathcal P(f(x)) = x\) for all \(x\) in \([0,1]\). Note: to make things a bit more interesting, we’re using \(y^2\) here for \(\mathcal P\) instead of the more common \(x^2\) parabola, and the discretization is simply given by representing the \(x\) and \(y\) via floating point numbers in the computer for this simple case.

We know that possible solutions for \(f\) are the positive or negative square root function (for completeness: piecewise combinations would also be possible). Knowing that this is not overly difficult, a solution that suggests itself is to train a neural network to approximate this inverse mapping \(f\). Doing this in the “classical” supervised manner, i.e. purely based on data, is an obvious starting point. After all, this approach was shown to be a powerful tool for a variety of other applications, e.g., in computer vision.

import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt

For supervised training, we can employ our solver \(\mathcal P\) for the problem to pre-compute the solutions we need for training: We randomly choose between the positive and the negative square root. This resembles the general case, where we would gather all data available to us (e.g., using optimization techniques to compute the solutions). Such data collection typically does not favor one particular mode from multimodal solutions.

# X-Data
N = 200
X = np.random.random(N)
# Generation Y-Data
sign = (- np.ones((N,)))**np.random.randint(2,size=N)
Y = np.sqrt(X) * sign

Now we can define a network, the loss, and the training configuration. We’ll use a simple keras architecture with three hidden layers, ReLU activations.

# Neural network
act = tf.keras.layers.ReLU()
nn_sv = tf.keras.models.Sequential([
  tf.keras.layers.Dense(10, activation=act),
  tf.keras.layers.Dense(10, activation=act),
  tf.keras.layers.Dense(1,activation='linear')])

And we can start training via a simple mean squared error loss, using fit function from keras:

# Loss function
loss_sv = tf.keras.losses.MeanSquaredError()
optimizer_sv = tf.keras.optimizers.Adam(lr=0.001)
nn_sv.compile(optimizer=optimizer_sv, loss=loss_sv)

# Training
results_sv = nn_sv.fit(X, Y, epochs=5, batch_size= 5, verbose=1)
Epoch 1/5
40/40 [==============================] - 0s 1ms/step - loss: 0.5084
Epoch 2/5
40/40 [==============================] - 0s 1ms/step - loss: 0.5022
Epoch 3/5
40/40 [==============================] - 0s 1ms/step - loss: 0.5011
Epoch 4/5
40/40 [==============================] - 0s 1ms/step - loss: 0.5002
Epoch 5/5
40/40 [==============================] - 0s 1ms/step - loss: 0.5007

As both NN and the data set are very small, the training converges very quickly. However, if we inspect the predictions of the network, we can see that it is nowhere near the solution we were hoping to find: it averages between the data points on both sides of the x-axis and therefore fails to find satisfying solutions to the problem above.

The following plot nicely highlights this: it shows the data in light gray, and the supervised solution in red.

# Results
plt.plot(X,Y,'.',label='Data points', color="lightgray")
plt.plot(X,nn_sv.predict(X),'.',label='Supervised', color="red")
plt.xlabel('y')
plt.ylabel('x')
plt.title('Standard approach')
plt.legend()
plt.show()
_images/intro-teaser_15_0.png

😱 This is obviously completely wrong! The red solution is nowhere near one of the two modes of our solution shown in gray.

Note that the red line is often not perfectly at zero, which is where the two modes of the solution should average out in the continuous setting. This is caused by the relatively coarse sampling with only 200 points in this example.


A differentiable physics approach

Now let’s apply a differentiable physics approach to find \(f\): we’ll directly include our discretized model \(\mathcal P\) in the training.

There is no real data generation step; we only need to sample from the \([0,1]\) interval. We’ll simply keep the same \(x\) locations used in the previous case, and a new instance of a NN with the same architecture as before nn_dp:

# X-Data
# X = X , we can directly re-use the X from above, nothing has changed...
# Y is evaluated on the fly

# Model
nn_dp = tf.keras.models.Sequential([
  tf.keras.layers.Dense(10, activation=act),
  tf.keras.layers.Dense(10, activation=act),
  tf.keras.layers.Dense(1, activation='linear')])

The loss function is the crucial point for training: we directly incorporate the function f into the loss. In this simple case, the loss_dp function simply computes the square of the prediction y_pred.

Later on, a lot more could happen here: we could evaluate finite-difference stencils on the predicted solution, or compute a whole implicit time-integration step of a solver. Here we have a simple mean-squared error term of the form \(|y_{\text{pred}}^2 - y_{\text{true}}|^2\), which we are minimizing during training. It’s not necessary to make it so simple: the more knowledge and numerical methods we can incorporate, the better we can guide the training process.

#Loss
mse = tf.keras.losses.MeanSquaredError()
def loss_dp(y_true, y_pred):
    return mse(y_true,y_pred**2)

optimizer_dp = tf.keras.optimizers.Adam(lr=0.001)
nn_dp.compile(optimizer=optimizer_dp, loss=loss_dp)
#Training
results_dp = nn_dp.fit(X, X, epochs=5, batch_size=5, verbose=1)
Epoch 1/5
40/40 [==============================] - 0s 656us/step - loss: 0.2814
Epoch 2/5
40/40 [==============================] - 0s 1ms/step - loss: 0.1259
Epoch 3/5
40/40 [==============================] - 0s 962us/step - loss: 0.0038
Epoch 4/5
40/40 [==============================] - 0s 949us/step - loss: 0.0014
Epoch 5/5
40/40 [==============================] - 0s 645us/step - loss: 0.0012

Now the network actually has learned a good inverse of the parabola function! The following plot shows the solution in green.

# Results
plt.plot(X,Y,'.',label='Datapoints', color="lightgray")
#plt.plot(X,nn_sv.predict(X),'.',label='Supervised', color="red") # optional for comparison
plt.plot(X,nn_dp.predict(X),'.',label='Diff. Phys.', color="green") 
plt.xlabel('x')
plt.ylabel('y')
plt.title('Differentiable physics approach')
plt.legend()
plt.show()
_images/intro-teaser_24_0.png

This looks much better 😎, at least in the range of 0.1 to 1.

What has happened here?

  • We’ve prevented an undesired averaging of multiple modes in the solution by evaluating our discrete model w.r.t. current prediction of the network, rather than using a pre-computed solution. This lets us find the best mode near the network prediction, and prevents an averaging of the modes that exist in the solution manifold.

  • We’re still only getting one side of the curve! This is to be expected because we’re representing the solutions with a deterministic function. Hence, we can only represent a single mode. Interestingly, whether it’s the top or bottom mode is determined by the random initialization of the weights in \(f\) - run the example a couple of times to see this effect in action. To capture multiple modes we’d need to extend the NN to capture the full distribution of the outputs and parametrize it with additional dimensions.

  • The region with \(x\) near zero is typically still off in this example. The network essentially learns a linear approximation of one half of the parabola here. This is partially caused by the weak neural network: it is very small and shallow. In addition, the evenly spread of sample points along the x-axis bias the NN towards the larger y values. These contribute more to the loss, and hence the network invests most of its resources to reduce the error in this region.

Discussion

It’s a very simple example, but it very clearly shows a failure case for supervised learning. While it might seem very artificial at first sight, many practical PDEs exhibit a variety of these modes, and it’s often not clear where (and how many) exist in the solution manifold we’re interested in. Using supervised learning is very dangerous in such cases. We might unknowingly get an average of these different modes.

Good and obvious examples are bifurcations in fluid flow. Smoke rising above a candle will start out straight, and then, due to tiny perturbations in its motion, start oscillating in a random direction. The images below illustrate this case via numerical perturbations: the perfectly symmetric setup will start turning left or right, depending on how the approximation errors build up. Averaging the two modes would give an unphysical, straight flow similar to the parabola example above.

Similarly, we have different modes in many numerical solutions, and typically it’s important to recover them, rather than averaging them out. Hence, we’ll show how to leverage training via differentiable physics in the following chapters for more practical and complex cases.

_images/intro-fluid-bifurcation.jpg

Fig. 3 A bifurcation in a buoyancy-driven fluid flow: the “smoke” shown in green color starts rising in a perfectly straight manner, but tiny numerical inaccuracies grow over time to lead to an instability with vortices alternating to one side (top-right), or in the opposite direction (bottom-right).

Next steps

For each of the following notebooks, there’s a “next steps” section like the one below which contains recommendations about where to start modifying the code. After all, the whole point of these notebooks is to have readily executable programs as a basis for own experiments. The data set and NN sizes of the examples are often quite small to reduce the runtime of the notebooks, but they’re nonetheless good starting points for potentially complex and large projects.

For the simple DP example above:

  • This notebook is intentionally using a very simple setup. Change the training setup and NN above to obtain a higher-quality solution such as the green one shown in the very first image at the top.

  • Or try extending the setup to a 2D case, i.e. a paraboloid. Given the function \(\mathcal P:(y_1,y_2)\to y_1^2+y_2^2\), find an inverse function \(f\) such that \(\mathcal P(f(x)) = x\) for all \(x\) in \([0,1]\).

  • If you want to experiment without installing anything, you can also [run this notebook in colab].