Cross-Entropy

Cross-entropy measures the average number of bits needed to encode data from one distribution (the true distribution) using the wrong distribution (the model distribution). It’s used frequently in machine learning to quantify the difference between the true distribution of the data and the predictions made by a model. The definition of cross-entropy between two distribution and is given by:

where is the true distribution and is the model distribution. If we were sending messages, represents the true distribution of messages and represents the distribution we use to encode the messages, then cross-entropy gives the average number of bits required to encode messages from the true distribution using the code optimized for the estimated distribution.

KL Divergence

KL divergence measures the inefficiency of assuming that the distribution is when the true distribution is . It’s a measure of how one probability distribution diverges from a second, expected probability distribution. KL divergence can be seen as the number of extra bits required to encode samples from using a code optimized for rather than the optimal code. The formula for KL divergence is:

KL divergence can also be interpreted in the context of information gain or the number of bits saved when the true distribution is used instead of the estimated distribution , relative to the inefficiency introduced by using . However, it’s not strictly the number of bits saved by knowing the distribution in the sense of switching coding schemes, but rather the difference in the expected number of bits required to represent the data under two different distributions.

So, while your descriptions are conceptually on track, it’s important to be precise about what these terms mean and how they are used, especially in contexts like machine learning and information theory.