In the UBS SP500 example, it selects the FANG stock.So the candidates may select certain stocks as well, they can check the best stocks during these years, and conduct their strategies on them. This may generate good results in all zooms, but this is meaningless, since they somehow uses “future data”, even this is not showed in their code, so I want to ask how you solve this problem.
Hello Jason,
Thanks for your question. I agree with you that selection and survivorship biases are an issue. After every submission, there will be a review phase where we look at the code for certain biases in the logic. Please be assured that any hardcoding of stock names (such as FANG) or dates will be spotted and sent back to the user for revision. The payout code should only select stocks based on its features (stock names are not a feature).
Other tests that we do include a test for data leakage, where we check if the code is training and predicting values using the same/overlapping dataset.
This is also why we encourage users to submit their model early, so that they will have time to revise it before the deadline is reached.
I am wondering, to what extend should we use the “future data”?
If I use a function that takes in a time span, and run a selection process (with non hard code) output a certain list of stock names which behaves good, and I assume these stocks will continually behave well in the future, hence I hold all the stocks from beginning to the end with equal weights. Is that a misuse of information?
i.e. can we safely regard historical price data in 2006-2017 as a training dataset?
Hello - you should not use ‘future’ data, this is what is called look ahead bias. Please do more research on this and have a look at this page : https://wiki.alphien.com/ALwiki/Preventing_look_ahead_bias_in_payout .
That’s part of the challenge you are facing to create a good strategy that can work in the future !
Happy coding ! Lionel.