The library business data service is helping to deliver some “Introduction to Stata” training sessions.
Goto http://www.pbres.co.uk/ and select Research for the slides and training datasets. (Good example of joining data from separate sources.)
This seemed a good opportuntity to update the answer:
If you find additional Stata resources you would recommend, please leave a reply.
This scatterplot shows the relationship between the percentage total return in the following year (RetP1) and the optimism of the chairman’s statement in the annual report (Optim). In general it appears that increased optimism only corresponds with increased variability in returns.
This has only been done for a small sample of 29 FTSE 100 companies over the years 2006-2011. The main purpose was to provide an example of analysis using textual, rather than numeric, information in company annual reports.
- Identify the FTSE 100 companies at 31 December 2011. [Historical FTSE 100 Index constituents (July 2012)]
- Obtain the annual reports for a sample of these companies. [Where can I find current and historical company annual reports?]
- Convert the annual reports from PDF to Text.
- Extract the chairman’s statement from each report. (This could only be done manually one report at a time)
- Analyse the statements using the text analysis software Diction.
- Link the results from Diction with the total return (obtained from Datastream) for the same and following year [Total shareholder return (July 2011)].
From a library perspective the main interest is the research process rather than the results, and specifically the data collection needed to assemble a dataset for analysis (steps 1-4 above). Researchers have to work hard to transform the raw data available to them into a dataset ready for analysis.
Manchester academic research in a similar area -
Further exploring our small sample dataset
Another scatterplot showing the relationship between the percentage total return in the current year and the optimism of the chairman’s statement shows a similar lack of correlation.
Looking at both the certainty and the optimism that is measured in the chairmen’s statements, this small sample shows that a certainty below average and an optimism below average give a total return in the following year, and in the current year, that higher.
This is only a small sample, and we have no theory why selecting companies because their chairman’s statement is uncertain and pessimistic should be better than random selection.
There are times when the database query doesn’t match the data that you are trying to obtain. For example, you have a list of X companies and for each company the date of an event E (e.g. a merger) but the database query has to be in terms of a list of X companies and a date range Ya-Yb.
WRDS is a typical example – you have a list of 100 gvkey – year pairs where you want some accounting data from Compustat NA (North America). The company identifier gvkey is used to identify companies in Compustat so you can use it in WRDS Compustat queries.
Option 1 – Multiple WRDS requests
Sort your list by the year and create a company list for each year, say N lists. You then make N WRDS-Compustat requests using the relevant company list and year. Finally you combine the N results.
This option is good if you have to make a few requests (i.e N is small)
Option 2 – One WRDS request and select the results you want
Calculate the range of years that you need. You then make one WRDS-Compustat request covering all these years.
If you have 1990-2010 you could now have a result set with 2100 observations (21 years times 100 companies) and you need to select the 100 relevant ones.
You can do this selection using the Merge command in a Statistics package (the following uses Stata as an example)
- Decide on the variable names in your WRDS results that are going to identify the observations of interest (for example gvkey and fyear)
- If necessary, Convert the WRDS results into a Stata dataset and save this.
- Create a Stata dataset of your 100 gvkey-year pairs using the same variable names as in your WRDS results dataset (e.g. gvkey and fyear).
(If necessary you may need to note and eliminate duplicates from this dataset – Stata commands duplicates list and drop duplicates)
- Merge your gvkey-year dataset with your WRDS results:
merge gvkey fyear using “P:\Projects\BMAN70162\acqwrds.dta”, unique sort
(Note ”P:\Projects\BMAN70162\acqwrds.dta” is the file saved at step 2)
- You should have one match for each observation in your gvkey-year dataset. This is indicated by the new Stata variable _merge = 3.
Observations with _merge = 1 indicate there was no matching entry in the WRDS result dataset – this might require further investigation.
- Observations with _merge = 2 are all the extra results you want to eliminate.
This technique does not depend on Stata – other Statistics packages provide similar merge functions but the detail of the commands will vary. (Indeed, it is probably possible to do this in Excel.)