Muddled measurements on clarity – Bank Underground

Jul 18, 2025

Font size:

Charlie Warburton and James Brookes

Economists have repeatedly shown that readability of central banking communication matters. But they typically measure readability in a crude way – using the simplistic but influential Flesch-Kincaid metric. The Flesch-Kincaid Grade Level is based on word and sentence length and is commonly interpreted as the number of years of education required to understand a text. However, recent advances in computational linguistics toolkits empower us to consider finer-grained markers of language comprehension missed by Flesch-Kincaid. Here, we revisit Jansen (2011) which found that Fed Chair testimonies with lower Flesch-Kincaid Grade Level scores – indicating higher readability – were associated with lower market volatility. Our results show that compared to more sophisticated linguistic metrics, Flesch-Kincaid is a relatively poorer indicator of readability.

What Flesch-Kincaid misses: introducing our novel linguistic metrics

Drawing on earlier work investigating press pick-up of Bank of England communications and asset price movement we develop a series of psycholinguistic metrics for text readability, which are intended to draw out text features directly linked to different aspects of language comprehension.

We develop four novel psycholinguistic text readability metrics:

Word Prevalence: words that are more commonly known are processed faster and more easily than words that are not.

Local Personal Pronoun Rate: we measure the rate of first (I, me, we, us, our, etc) and second person (you, your, yours, etc) pronouns in a document. Such usage establishes speaker-interlocuter rapport, and information that is flagged as being personally relevant is stored better and retrieved more accurately.

Contextual Expectancy Score: Contextual expectancy – the likelihood of a word in context – matters because whilst reading, the reader is predicting the upcoming word. In other words, upcoming words are already being accessed from the mental lexicon ahead of their being read. When a word is read that is not expected, the reader needs to retrieve that unexpected word, causing a processing difficulty.

Mean Dependency Arc Length: Although two sentences may contain the same number of words, and the same words, one may be easier to process than the other because related words are kept closer together. For example:

The distance (in words) between a word and its dependent is called its arc length. In (1), the arc length is 1, in (2) it is 6 – this makes (1) easier to process.

To exemplify the power of these metrics, let’s compare the well-known pangram ‘The quick brown fox jumps over the lazy dog’ with another but totally incomprehensible pangram ‘Cwm fjord-bank glyphs vext quiz’.

Metric	Cwm fjord-bank glyphs vext quiz	The quick brown fox jumps over the lazy dog	Heuristic
Flesch-Kincaid Grade Level	0.5	2.3	Lower is better
Average Word Prevalence	1.64	2.39	Higher is better
Local Personal Pronoun Rate	0	0	Higher is better
Contextual Expectancy Score	0.078	0.18	Higher is better
Mean Dependency Arc Length	1.8	1.75	Lower is better

As expected, our psycholinguistic metrics show that ‘The quick brown fox…’ is easier to understand. However, the Flesch-Kincaid Grade Level suggests the reverse is true and the meaningless ‘Cwm fjord-bank…’ is easier to understand! Furthermore, the Grade Level for ‘Cwm fjord-bank…’ is 0.5. If we were to follow the interpretation that it reflects the number of years of education required to understand the text, this should be understood by a primary school student.

This example demonstrates the danger of relying on overly simple metrics such as the Flesch-Kincaid Grade Level. We now revisit an earlier study which used the Grade Level, and add in the linguistic features above.

Empirical application: testing the relationship between readability and market volatility

Jansen (2011) investigated the semi-annual ‘Humphrey-Hawkins’ testimonies given by the Chair of the Federal Reserve to Congress to test the relationship between communication clarity and market volatility. The author found that testimonies with lower Grade Level scores (~greater clarity) were thereafter associated with lower volatility in medium-term interest rates.

To assess the relative effectiveness of the Flesch-Kincaid Grade Level as an indicator of communication clarity, we calculate the psycholinguistic metrics we discussed above for the testimonies and test their predictive power for market volatility alongside the Flesch-Kincaid Grade Level. In line with the original study, we focus on medium term interest rate volatility, specifically, the three-year treasury market. (Similar results are obtained when analysing the two- and five-year markets.)

The original study relied solely on a least-squares regression approach to assess the relationship between readability and market volatility, whereas we employ two different models to assess the relative performance of Flesch-Kincaid against our novel metrics. We use a non-parametric random forest model to study the relative association of the text readability metrics with subsequent market volatility in a non-parametric non-linear setting. We then additionally use a ridge regression model to examine the association in a parametric linear setting and allows for statistical testing.

We first assess the relative importance of the text readability metrics for volatility in the three-year treasury yield by using a random forest model.

A random forest is a collection of decision trees whose predictions are averaged. We use a variant called conditional inference forests which are collections of conditional inference trees. Each tree aimed to predict volatility in the three-year treasury yield based on the textual features. We refer the reader to another Bank Underground blog post describing the details of how random forests work.

We grew 500 trees this way and then calculated the variable importance statistics based on the model. Variable importance is measured by evaluating the increase in error of the random forest model when each variable is removed. A high increase in error signals importance, whilst a low increase in error signals unimportance. For reasons of stability, we ran 100 iterations and averaged the variable importance statistics to produce our results.

The Flesch-Kincaid Grade Level has the lowest importance of all the text readability metrics considered. When it was removed from the model, the average increase in error was only around 0.5%. In contrast, the model’s error rate increased by over 7% on average when word prevalence was removed. These results signal that when other psycholinguistic metrics are included, the Flesch-Kincaid Grade level is not an important determinant of the random forest’s results. This finding is robust to using alternative treasury maturities as the dependent variable and including controls for macroeconomic conditions, time effects, and the Federal Reserve chair.

We now examine the relative performance of the text readability metrics in a parametric model. This is closer to the approach used in Jansen (2011), although we employ a ridge regression model to control for correlation between the covariates.

We transformed the text readability metrics into standardised scores. This means the coefficient can be interpreted as the association – in standard deviations – between a one unit increase in the variable and subsequent volatility in the three-year treasury yield.

Using 5,000 bootstrapped samples, we applied a ridge regression model to produce a distribution of coefficients. Bootstrapping helps to assess the stability and reliability of the ridge regression estimates across different subsamples of the data.

The boxplot displays the lower quartile, median, upper quartile and 95% confidence intervals of the coefficient distributions. The median value of the Flesch-Kincaid Grade Level’s coefficient is slightly positive – indicating a higher grade level is associated with slightly higher volatility. However, this effect is not significant at the 10% level. In fact, the entire lower quartile of the distribution is below zero. Therefore, we cannot conclude that grade level has any association with volatility once our other text readability metrics are considered. This finding was robust to the choice of other medium term yield maturities.

What should we make of word prevalence and dependency arc length? Word prevalence is fairly simple to explain: the more people that know a word in the text on average, ie the more accessible and understandable the words are in the texts, the more readable it becomes, and we see that this is associated to lower market volatility. For dependency arc length, the more discontinuous and far-apart related words are in the document, the more structurally complex the text should become to read and thus we might expect market volatility to increase. However, the opposite happens. We think this effect is because the presence of complex dependency structure can indicate the presence of chained subordination (clauses that go inside each other), which is used to add supporting, clarificatory information in overt and coherent ways and thereby has the effect of reducing uncertainty around the messaging. Future research might want to test the presence of subordination as an additional variable.

Rethinking readability: implications for clearer communication

We find that, in relative terms, the Flesch-Kincaid Grade Level holds less predictive power for market volatility once other measures of text readability are considered. This points to less power in the context of broader readability and challenges the traditional reliance on Flesch-Kincaid.

This is not just academic pedantry; the Flesch-Kincaid Grade Level is also widely used to measure the readability of documents in, eg, government and education. The se more sophisticated psycholinguistic metrics we have test the Flesch-Kincaid Grade Level against can be straightforwardly implemented, by using one’s own code, as we have done, or by using packages such as LingFeat. By adopting improved readability metrics, central bankers can better diagnose textual complexity and craft communications that the public more readily understands. This reduces the risk of costly misinterpretation.

In our study, we find that word prevalence – a metric tracking word frequency and familiarity – has the strongest association to communication clarity and lower subsequent market volatility. This finding aligns with the insights from a recent Bank of England Staff Working Paper, which emphasizes the importance of conceptual complexity of words – their meaning – over grammatical and structural elements for communication clarity.

It is finally worth noting that our results apply within an English-language dominant perspective. This affects the extent to which the findings could apply to central bank communications more broadly. Further analysis in this area is therefore warranted.

Charlie Warburton is a MPhil student at University of Cambridge and James Brookes works in the Bank’s Advanced Analytics Division. This post was written while Charlie Warburton was working in the Bank’s Governance, Accounting, Resilience and Data Division.

If you want to get in touch, please email us at [email protected] or leave a comment below.

Comments will only appear once approved by a moderator, and are only published where a full name is supplied. Bank Underground is a blog for Bank of England staff to share views that challenge – or support – prevailing policy orthodoxies. The views expressed here are those of the authors, and are not necessarily those of the Bank of England, or its policy committees.