Post #2266 by tedthedog on the Mechanical Investing board

Investment Strategies / Mechanical Investing ❤

Unthreaded | Threaded | Whole Thread (5)

Post Reply | Report Post | Recommend It!

No. of Recommendations: 5

Sorry, I'm late to the "ML for MI" thread
https://www.shrewdm.com/MB?pid=410527735
But I'm quite interested, so thought I'd take the liberty to post a few questions seperately in case they might not be seen tacked onto the old thread. Kudos to the OP!

For ML I've been putting in 60 to 100 financial and momentum features
Am curious about the features you used, do they include "fundamental ratios" e.g. P/B, P/S, etc often used in screens?

Screens seem similar to decision trees. Have you tried something like Briemans "Classification and Regression Trees" or random forests on fundamental data to create screens? This isn't "deep learning" a la neural networks, but it's a powerful ML algorithm.

Deep learning neural nets don't do well on numerical data. But I think they might handle categorizations such as "very low, low, medium, fairly high, high", which may be all the precision that's needed (see below).
Perhaps one could train a deep neural network on fundamental data and with the right prompts have them create a screen?
It could be cool and possibly useful to have "an AI" (I hate that terminology!) to suggest screens (Jim, your job is safe, I'm sure it'd never replace you!)

Below is a ChatGPT chat where I test how well it understands a categorization of data that has values e.g. converting values to "very low, low, medium, fairly high, high". It seemed to "get it", but you don't know when these things will start hallucinating:

Me:
If you had to pick the largest value from sets of values categorized as "very low, low, medium, fairly high, high", which one of these five categories would you pick?
ChatGPT:
To select the largest value among the categories "very low, low, medium, fairly high, high," the category "high" would be chosen.

Me:
How about a sort of middle value?
ChatGPT:
To select a middle value among the categories "very low, low, medium, fairly high, high," the category "medium" would be chosen.

Me:
What category would you assign to the ratio of values in "High" to values in "Low"?
ChatGPT:
The ratio of values in the "High" category to values in the "Low" category would likely be assigned to the "very high" category. This is because dividing a high value by a low value generally results in a significantly large ratio.

Print the post

Post Reply | Report Post | Recommend It!

No. of Recommendations: 6

For ML I've been putting in 60 to 100 financial and momentum features
Am curious about the features you used, do they include "fundamental ratios" e.g. P/B, P/S, etc often used in screens?

There are about 100 standard zoo factors widely used. For example, "Empirical Asset Pricing via Machine Learning" uses:
"a large collection of stock-level predictive characteristics based on the cross-section of stock returns literature. These include 94 characteristics (61 of which are updated annually, 13 are updated quarterly, and 20 are updated monthly). In addition, we include 74 industry dummies corresponding to the first two digits of Standard Industrial Classification (SIC) codes. Table A.6 in the Internet Appendix provides the details of these characteristics. We also construct eight macroeconomic predictors"

No. Acronym Firm characteristic Paper�s author(s) Year, Journal Data Source Frequency
1 absacc Absolute accruals Bandyopadhyay, Huang & Wirjanto 2010, WP Compustat Annual
2 acc Working capital accruals Sloan 1996, TAR Compustat Annual
3 aeavol Abnormal earnings announcement volume Lerman, Livnat & Mendenhall 2007, WP Compustat+CRSP Quarterly
4 age # years since first Compustat coverage Jiang, Lee & Zhang 2005, RAS Compustat Annual
5 agr Asset growth Cooper, Gulen & Schill 2008, JF Compustat Annual
6 baspread Bid-ask spread Amihud & Mendelson 1989, JF CRSP Monthly
7 beta Beta Fama & MacBeth 1973, JPE CRSP Monthly
8 betasq Beta squared Fama & MacBeth 1973, JPE CRSP Monthly
9 bm Book-to-market Rosenberg, Reid & Lanstein 1985, JPM Compustat+CRSP Annual
10 bm ia Industry-adjusted book to market Asness, Porter & Stevens 2000, WP Compustat+CRSP Annual
11 cash Cash holdings Palazzo 2012, JFE Compustat Quarterly
12 cashdebt Cash flow to debt Ou & Penman 1989, JAE Compustat Annual
13 cashpr Cash productivity Chandrashekar & Rao 2009, WP Compustat Annual
14 cfp Cash flow to price ratio Desai, Rajgopal & Venkatachalam 2004, TAR Compustat Annual
15 cfp ia Industry-adjusted cash flow to price ratio Asness, Porter & Stevens 2000, WP Compustat Annual
16 chatoia Industry-adjusted change in asset turnover Soliman 2008, TAR Compustat Annual
17 chcsho Change in shares outstanding Pontiff & Woodgate 2008, JF Compustat Annual
18 chempia Industry-adjusted change in employees Asness, Porter & Stevens 1994, WP Compustat Annual
19 chinv Change in inventory Thomas & Zhang 2002, RAS Compustat Annual
20 chmom Change in 6-month momentum Gettleman & Marks 2006, WP CRSP Monthly
21 chpmia Industry-adjusted change in profit margin Soliman 2008, TAR Compustat Annual
22 chtx Change in tax expense Thomas & Zhang 2011, JAR Compustat Quarterly
23 cinvest Corporate investment Titman, Wei & Xie 2004, JFQA Compustat Quarterly
24 convind Convertible debt indicator Valta 2016, JFQA Compustat Annual
25 currat Current ratio Ou & Penman 1989, JAE Compustat Annual
26 depr Depreciation / PP&E Holthausen & Larcker 1992, JAE Compustat Annual
27 divi Dividend initiation Michaely, Thaler & Womack 1995, JF Compustat Annual
28 divo Dividend omission Michaely, Thaler & Womack 1995, JF Compustat Annual
29 dolvol Dollar trading volume Chordia, Subrahmanyam & Anshuman 2001, JFE CRSP Monthly
30 dy Dividend to price Litzenberger & Ramaswamy 1982, JF Compustat Annual
31 ear Earnings announcement return Kishore, Brandt, Santa-Clara & Venkatachalam 2008, WP Compustat+CRSP Quarterly
32 egr Growth in common shareholder equity Richardson, Sloan, Soliman & Tuna 2005, JAE Compustat Annual
33 ep Earnings to price Basu 1977, JF Compustat Annual
34 gma Gross profitability Novy-Marx 2013, JFE Compustat Annual
35 grCAPX Growth in capital expenditures Anderson & Garcia-Feijoo 2006, JF Compustat Annual
36 grltnoa Growth in long term net operating assets Fairfield, Whisenant & Yohn 2003, TAR Compustat Annual
37 herf Industry sales concentration Hou & Robinson 2006, JF Compustat Annual
38 hire Employee growth rate Bazdresch, Belo & Lin 2014, JPE Compustat Annual
39 idiovol Idiosyncratic return volatility Ali, Hwang & Trombley 2003, JFE CRSP Monthly
40 ill Illiquidity Amihud 2002, JFM CRSP Monthly
41 indmom Industry momentum Moskowitz & Grinblatt 1999, JF CRSP Monthly
42 invest Capital expenditures and inventory Chen & Zhang 2010, JF Compustat Annual
43 lev Leverage Bhandari 1988, JF Compustat Annual
44 lgr Growth in long-term debt Richardson, Sloan, Soliman & Tuna 2005, JAE Compustat Annual
45 maxret Maximum daily return Bali, Cakici & Whitelaw 2011, JFE CRSP Monthly
46 mom12m 12-month momentum Jegadeesh 1990, JF CRSP Monthly
47 mom1m 1-month momentum Jegadeesh & Titman 1993, JF CRSP Monthly
48 mom36m 36-month momentum Jegadeesh & Titman 1993, JF CRSP Monthly
49 mom6m 6-month momentum Jegadeesh & Titman 1993, JF CRSP Monthly
50 ms Financial statement score Mohanram 2005, RAS Compustat Quarterly
51 mvel1 Size Banz 1981, JFE CRSP Monthly
52 mve ia Industry-adjusted size Asness, Porter & Stevens 2000, WP Compustat Annual
53 nincr Number of earnings increases Barth, Elliott & Finn 1999, JAR Compustat Quarterly
54 operprof Operating profitability Fama & French 2015, JFE Compustat Annual
55 orgcap Organizational capital Eisfeldt & Papanikolaou 2013, JF Compustat Annual
56 pchcapx ia Industry adjusted % change in capital expenditures Abarbanell & Bushee 1998, TAR Compustat Annual
57 pchcurrat % change in current ratio Ou & Penman 1989, JAE Compustat Annual
58 pchdepr % change in depreciation Holthausen & Larcker 1992, JAE Compustat Annual
59 pchgm pchsale % change in gross margin - % change in sales Abarbanell & Bushee 1998, TAR Compustat Annual
60 pchquick % change in quick ratio Ou & Penman 1989, JAE Compustat Annual
61 pchsale pchinvt % change in sales - % change in inventory Abarbanell & Bushee 1998, TAR Compustat Annual
62 pchsale pchrect % change in sales - % change in A/R Abarbanell & Bushee 1998, TAR Compustat Annual
63 pchsale pchxsga % change in sales - % change in SG&A Abarbanell & Bushee 1998, TAR Compustat Annual
64 pchsaleinv % change sales-to-inventory Ou & Penman 1989, JAE Compustat Annual
65 pctacc Percent accruals Hafzalla, Lundholm & Van Winkle 2011, TAR Compustat Annual
66 pricedelay Price delay Hou & Moskowitz 2005, RFS CRSP Monthly
67 ps Financial statements score Piotroski 2000, JAR Compustat Annual
68 quick Quick ratio Ou & Penman 1989, JAE Compustat Annual
69 rd R&D increase Eberhart, Maxwell & Siddique 2004, JF Compustat Annual
70 rd mve R&D to market capitalization Guo, Lev & Shi 2006, JBFA Compustat Annual
71 rd sale R&D to sales Guo, Lev & Shi 2006, JBFA Compustat Annual
72 realestate Real estate holdings Tuzel 2010, RFS Compustat Annual
73 retvol Return volatility Ang, Hodrick, Xing & Zhang 2006, JF CRSP Monthly
74 roaq Return on assets Balakrishnan, Bartov & Faurel 2010, JAE Compustat Quarterly
75 roavol Earnings volatility Francis, LaFond, Olsson & Schipper 2004, TAR Compustat Quarterly
76 roeq Return on equity Hou, Xue & Zhang 2015, RFS Compustat Quarterly
77 roic Return on invested capital Brown & Rowe 2007, WP Compustat Annual
78 rsup Revenue surprise Kama 2009, JBFA Compustat Quarterly
79 salecash Sales to cash Ou & Penman 1989, JAE Compustat Annual
80 saleinv Sales to inventory Ou & Penman 1989, JAE Compustat Annual
81 salerec Sales to receivables Ou & Penman 1989, JAE Compustat Annual
82 secured Secured debt Valta 2016, JFQA Compustat Annual
83 securedind Secured debt indicator Valta 2016, JFQA Compustat Annual
84 sgr Sales growth Lakonishok, Shleifer & Vishny 1994, JF Compustat Annual
85 sin Sin stocks Hong & Kacperczyk 2009, JFE Compustat Annual
86 sp Sales to price Barbee, Mukherji, & Raines 1996, FAJ Compustat Annual
87 std dolvol Volatility of liquidity (dollar trading volume) Chordia, Subrahmanyam & Anshuman 2001, JFE CRSP Monthly
88 std turn Volatility of liquidity (share turnover) Chordia, Subrahmanyam, &Anshuman 2001, JFE CRSP Monthly
89 stdacc Accrual volatility Bandyopadhyay, Huang & Wirjanto 2010, WP Compustat Quarterly
90 stdcf Cash flow volatility Huang 2009, JEF Compustat Quarterly
91 tang Debt capacity/firm tangibility Almeida & Campello 2007, RFS Compustat Annual
92 tb Tax income to book income Lev & Nissim 2004, TAR Compustat Annual
93 turn Share turnover Datar, Naik & Radcliffe 1998, JFM CRSP Monthly
94 zerotrade Zero trading days Liu 2006, JFE CRSP Monthly

I asked chatbot arena for a good stock screen using ROE, and got the following responses:

{ROE_Gemini202407}
ROE above 20%
EPS growth above 10%
Debt-to-equity ratio below 0.5
Price above the 50-day moving average
RSI below 70

{ROE_llama-3.1-70b-instruct202407}
ROE Greater than 15%
PE Less than 25
Debt-to-Equity Ratio Less than 0.5
Dividend Yield Greater than 2%
50-Day Moving Average Above 200-Day Moving Average
RSI Between 30 and 70

{ROE_llama-3.1-8b-instruct202407}
ROE Greater than 15%
PE Less than 20
Debt-to-Equity Ratio Less than 1
Dividend Yield Greater than 2%
50-Day Moving Average Above 200-Day Moving Average

These all look reasonable, but they do not outperform the market. General purpose LLM (plagiarism) is not useful yet for stock screens. Other ML methods (curve-fitting) might be, but the recent AI excitement is due to the LLM surprises (works better than expected).

Print the post

Post Reply | Report Post | Recommend It!

No. of Recommendations: 1

Thank you for that list!
I'm glad someone did it, i.e. throwing all the random forests, neural net, new stuff against trying to predict stock returns from fundamentals, "Empirical Asset Pricing via Machine Learning" paper:
https://papers.ssrn.com/sol3/papers.cfm?abstract_i...
But IMHO, the goal seems a little mis-guided.

The LLM stock screen is amusing.
But I wasn't thinking of asking an existing LLM to provide a stock screen, but to train an "AI" to identify "dog" stocks from e.g. "cheetah" stocks.

This is just blue-skying:

I spent a couple of minute online, so no doubt there's something better I missed, but this one at least exemplifies a ML framework that one can fairly easily train on one's own data
https://developer.apple.com/machine-learning/creat...

An amusing story of a guy using the above Apple thingy to identify his cat, using his rather small "database" of pics of his own cat, from a bunch of online cat pics
https://www.linkedin.com/pulse/finding-my-feline-f...

An initial thought is to translate all finanical data such as P/B ratios etc into categories e.g. "very low, low, medium, medium high, high" because LLMs aren't great with numbers/arithmetic. This
should be carefully done because e.g. certain sectors tend to have low P/B while others high, etc, so a little thought put into the categorization could be useful.
After translating numbers into categories, then can use text AI's.

With the Apple one, or something similar, could use its "text classification" mode, "word tagging" mode, or "tabular" mode. In classification mode: instead of "great movie" or "bad movie" from reading reviews it'd output "great investment" or "bad investment" from reading (categorized) fundamentals etc. In "word tagging" mode: instead of "iphone" it'd tag "good investment", etc.

Cats from dogs?
https://www.zdnet.com/article/what-does-ai-know-of...
I'm not sure you'd want to fake an existing purpose-built image classifier into doing the finance problem, but maybe one could try. Instead, you might have much better success if build your own AI using some tool like the Apple one to identify, from financial features, the dog stocks from cheetah stocks.

Print the post

Post Reply | Report Post | Recommend It!

No. of Recommendations: 1

Hello there tedthedog ..... Great thoughts and questions!

Also lizgdal ..... Thanks for the list ... its interesting.

A couple of comments : QUOTE : "because LLMs aren't great with numbers/arithmetic" ENDQUOTE

This statement is misguided - actually all "AI/ML" tools in the end are numerical encoded ( ie binary/continuous) - LLMs or Generative AI is based on CNNs ( Convolution Neural Networks) and NNs cant handle anything other than numeric data at the final processing end. What's happening in the intermediate is how they have tokenized the text/image components etc - which might lead you to believe that categorization is the way to go.

However this is the fundamental truth in all machine algorithms - Numeric to Categorization = Loss of Fidelity/Info - ie you are ALWAYS better off to feed in the original numerical values ( Unless in some extreme cases where the information is full of noise and a human expert can make better sense)

NET : There's no need to recode Numeric data into arbitrary categories to feed into MLs. The main thing to remember about such numeric data is that algos will naturally attribute a distance metric to the underlying data and for any reason the data is actually ordinal ( ie the values are ordered but the difference between is not representative) - you are better of using those as Factors ( eg say Zacks Rank or Value Line Score etc)

Where LLMs can potentially be useful - since they are basically a "complete the sentence/thought" is to use text/language inputs like Twitter/Reddit/Filings commentary etc to see whether the LLMs produce better extracted features from them - which can then be processed as factors

If you read some of the good books on Deep Learners - you will see that the authors (Definitely Hinton's school of thought/disciples) self-admit that its only in the "interpretative class" of problems like Image/Text/Thought etc where Deep Learners are useful. Numerical problems are typically better handled thru what I tend to refer as the "Stat-MLs" ie RandomForest and Boosters ( XGB,LGBM,CatBoost etc) or SVMs ( if you have a 2 class problem) etc

One novel use of CNNs could be to feed actually say Stockcharts images of Stock History and try to predict the future forward - ie total pattern classification. [Basically what all the CandleStick etc try to do with numerical rules] - Very likely will only help to prove the futility of things.

But there's a lot of attempts to use LSTM ( a specific class of generative learners) to do this in a numerical input way.

The list of the variables tried tells the story by itself:

(a) Majority of the variables are ANNUAL - which almost means the outcome of the stock performance is 1+-2 years out
(b) One of the key attributes of MLs is the "learning" capability - if data is not getting updated the model is stale
(c)Otherwise - this sort of an exercise boils down to

"Let me see if there's some SUPER COMPLEX non-linear representation which can fit the data I am feeding which can beat a simpler model". The outcomes of these become obvious - they wont generalize going forward typically and its just an extreme case of Data Mining

Hope this helps!
Best

Print the post

Post Reply | Report Post | Recommend It!

No. of Recommendations: 6

Tedthedog: Am curious about the features you used, do they include "fundamental ratios" e.g. P/B, P/S, etc. often used in screens?

Yes, the vast majority of the features being used are fundamental financial data which change at the quarterly reports, plus analysts estimates and then the usual price, volume, momentum and sentiment factors. Have fed the ML models as many as 400 features but removing most correlated features and using a much smaller subset works better.

Although I started out using my Stock Investor Pro data base coupled with scikit-learn I�ve put that on the back burner. Portfolio123 is developing a ML service for professional and individual investors. I was invited to join their beta users group trying out the product and evaluating different models before it will be released as a service. Beta user group includes a few well-versed professional finance individuals with ML experience and some essentially beginners. They are developing many more user-friendly tools for use by individual investors.

System includes most major ML models (9 categories with several subset within each category). In addition, you have the capability of defining your own hyperparameters for each model. Because financial data has a very large amount of noise an ensemble of different learning models seems to achieve better results. Users do not have access to proprietary data but can select their own features, run models, run simulations with buy/sell rules including friction using the data. Significantly better than I could have accomplished as an individual.

It works! Looks to be an excellent system but not likely to be affordable for small net worth investors.

Very first try months ago, selected 77 mostly typical fundamental data features typically used in screens. Trained 6 standard ML models (random forest, extra trees, xgboost, deeptables +) on 3.5 years of S&P mid cap data from 2004 that did not include the stock id just features and 3 month gain. Starting 6 months after the end of the training data s to avoid look ahead data leakage tested the models on 2 years of data. All the models performed better than the universe.
Second fold again starting in 2004 trained on 5.5 years of data, skipped 6 months and used models to predict future gain for 2 years, used the same models each restarting from scratch, same features. Again all retrained models outperformed the universe.
Repeated this 6 more times giving 16 years of future predictions which all achieved better than universe results.
Did not have to change the models used, the features selected to achieve significant outperformance.
Outperformed using 20, 40, 100, 200 stocks.
Some of the models tell you which factors they found most valuable.
Some of the newer ML models significantly outperform some of the classics used in my initial try but they all work without changing anything.

Print the post

Post New

Unthreaded | Threaded | Whole Thread (5)

Prev | Next

Announcements

Mechanical Investing FAQ

Contact Shrewd'm
Contact the developer of these message boards.