Could you imagine a future where computers made economic decisions rather than governments and central bankers? With all of the economic mishaps we’ve been seeing over the past decade, one could say it isn’t a particularly bad idea!
Natural language processing could allow us to make more sense of the economy than we do currently. As it stands, investors and policymakers use index benchmarks and quantitative measures such as GDP growth to gauge economic health.
That said, one potential application of NLP is to analyse text data (such as through major economic policy documents), and then “learn” from such texts in order to generate appropriate economic policies independently of human intervention.
In this example, an LSTM model is trained using text from a sample ECB policy document, in order to generate “new” text data, with a view to revealing insights from such text that could be used for policy purposes.
Specifically, a temperature hyperparameter is configured to control the randomness of text predictions generated, with the relevant text vectorized into sequences of characters, and the single-layer LSTM model then used for next character sampling - with a text generation loop then used to generate a block of text for each temperature (the higher the temperature, the more randomness induced in each block of text).
Convert PDF to text
As was done previously, the file in PDF format can be converted to plain text format as with pdf2txt as follows:
pdf2txt.py -o eb201903.txt eb201903.en.pdf
The text file is imported, the distribution function is defined (which is used to set the temperature parameter), and the relevant text is formatted to remove unnecessary punctuation. The full code is available at the relevant GitHub repository.
Here is a sample of the generated text:
“Economic Bulletin Issue Contents Update on economic and monetary developments Summary External environment Financial developments Economic activity Prices and costs Money and credit Boxes What the maturing tech cycle signals for the global economy Emerging market currencies the role of global risk the US dollar and domestic forces Exploring the factors behind the widening in euro area corporate bond spreads The predictive power of real M for real economic activity in the euro area Articles The economic implications of rising protectionism a euro area and global perspective Fiscal rules in the euro area and lessons from other monetary unions...”
Character Sequence Vectorization
The next task is to form vectorized sequences of characters, which is done to transform the text data into a format that the eventual LSTM model can understand - since recurrent neural networks work with sequential data - the text needs to be formatted as such in order to allow the model to ultimately generate new text from such sequences.
Text Generation with LSTM (Long-Short Term Memory Network)
The LSTM model is defined, and categorical_crossentropy is used as the loss function:
As we can see, the LSTM model is being run, and then the sampling function is being generated in order to modify the randomness of the predictions based on the temperature setting, i.e. the higher the temperature, the more inherent randomness in the generated text block.
Now, the model is trained over 100 epochs, and text blocks are generated.
Let’s have a look at the text generated by the LSTM.
** Epoch 1 Train on 48559 samples 48559/48559 [==============================] - 92s 2ms/sample - loss: 2.2975 ** Seed used for text generation: "udgeting Vol No See Kamps C and Leiner Killinger N Taking stock of the" ** Temperature reading: 0.3 udgeting Vol No See Kamps C and Leiner Killinger N Taking stock of the aspont dement and to the ass are tich on the the the fiscal cond cond allo ade and trodes in the dished to the aler and discal risk the ass a diss of the contricion on the and the the inder the promacting the and contricions and the ade as of cond debt on the govern and the ass and contrally mont the asso cond the alount indectic indectic cond stark the the fiscal indectic indecting in the wisher ** Temperature reading: 0.6 ic indectic cond stark the the fiscal indectic indecting in the wisher scased stark this mighter and burtory prowth conted rules inglowtromand to eurode the support fiscal probuctions a disled the the alonates on the ECB ECB Nally reted ECB ECromanditions prodactions Nomer a dowres with on the Dechors mishic and staty the Nover in the Fighes and the controrations are start risutic s as on the inder in the prodic incroms and by decade in the contrinitions and fincent ** Temperature reading: 0.9 r in the prodic incroms and by decade in the contrinitions and fincentic as Thises mast of deliges A fom the Sed Sents This stoved Diders hat reare add diverves Fishs thcter concreusione uy the Thar istary and line economic bate anduar on tessic to the as and grouchous agso exseating Stuter The USion of the intery incrisirations exports prownted inders fimmoditions of the by indermond in the Compasts Jince Tleasions etro contional devetic Unitionitithians P incecry ** Temperature reading: 1.2 sts Jince Tleasions etro contional devetic Unitionitithians P incecry Buernoten sCs and incred onIsirs Ficoneringesepriverces TVngh ELesrodity mLianwrs pDptig omsljimations diviuat Bules Leekmrevisgeryet prowsitherneaind this a quarccy fionaby axlycaltrofsembed Ksjowss DHPomedisiond fishmel fibMlo nody or hix Fhas Acfecc Exvecaad lhameslewayss monkedimeg expos a Eroysinglit ons efperm tive nuatern fisuty rith S OEECh Culnipmpicity ox s Sol f Gantaly onW Chfngated r
From the above, we can see that the text generated across the first epoch is highly incoherent. Let’s see what happens when the last epoch is generated:
** Epoch 99 Train on 48559 samples 48559/48559 [==============================] - 97s 2ms/sample - loss: 0.5080 ** Seed used for text generation: "US yields observed over the same period Notable outliers are Turkey an" ** Temperature reading: 0.3 US yields observed over the same period Notable outliers are Turkey and the developments in the euro area and global premiod instituta and the euro area and global policy in the supinite the euro area and global premiod insterly data for matuain regarding fundam unit data are and domestic production such for the euro area and global premiod institutions for the euro area and global premiod institution of the euro area and global premiod insterly be in the suping sta ** Temperature reading: 0.6 tion of the euro area and global premiod insterly be in the suping stantarial markets that have highly difined by and therefore to the statistics by While the first quarter of indicators need ensurte of for the euro area and services such bur is stite consentsing sussts only of c effective pooding for exports from in the euro area and global perspective Index for the euro area and browed and domestic production data loan market indicators s wideral such as a place a ** Temperature reading: 0.9 tic production data loan market indicators s wideral such as a place a model but ardic points of fructs are underlying ficang and exports only be in the US states from different capluction surrencling to dail often in indicators s widerpysing by an good Foremaing in Chart Enon Soutces Economic States reiming hole of of nonariscilation of the euro area countries diffureence to regress increased to funds to preds related to MFI spiness in len and en rate more signific ** Temperature reading: 1.2 funds to preds related to MFI spiness in len and en rate more significant in seasshs Governing data derivahis detaris data for The latiling US risks The global trade Bulres Exports Exportuchation surrenclaine and combsirally of China Chinese recessions US states are for real M and and are to the percentained term underly to an avord increasing the monetary unions steanally aC wheer the countries that are added in This rule braqus visiically ot exports Haurding and v
While there is still some incoherency in the text, we can see that the text the model is outputting is now more legible than what was first being outputted.
When using the seed “US yields observed over the same period Notable outliers are Turkey an”, we can see that the model is generating text that could help to explain fluctuations in U.S. yields and also notable outliers. For instance, terms such as “domestic production data”, “loan market indicators”, and “Chinese recessions” are factors that the model has indicated as being related to the seed text, and could therefore be cited as significant factors behind yield fluctuations.
In this regard, NLP is unlikely to replace human interpretation when it comes to interpreting data (at least in the near future). However, NLP is likely to prove a highly useful tool in complementing human understanding on specific issues.
We still need humans to ultimately create and interpret text (NLP is not as adept as a human is at tasks such as script or song writing), but advances in text generation can greatly complement human understanding of text, and specifically inferring ideas from that text.
In this example, you have seen how an LSTM model can be used to generate new text from existing sequences and ultimately create or infer new ideas from such text.
Many thanks for reading, and you can access the relevant GitHub repository and original post here. I also highly recommend the “Deep Learning with Python” textbook by Francois Chollet for gaining a further understanding of this topic.