> I can use smthg like GPT-4 to label data and then use that as a train set for my own LLM, right?
Yes, almost all improved LLama models are tuned exactly that way (trained on examples of questions and answers from say GPT 4). If OpenAI stole copyrighted works to train their models it is morally fair game to do the same to them regardless of their TOS. It's not like they can prove it anyway.
Plus there's the other point where they also say that everything generated by their models is public domain, so which one is it eh?
It is my understanding that this is how “alignment” works.
That is, openAI paid people to chat with their LLM to fine tune it and then other LLMs use chatgpt to generate training data to align their models.
Yup, totally. This is a form of knowledge distillation. Openai, or other foundational model providers, can't really do anything about it.
Yes, and in fact that's the best method available if you want good performance. I would suggest using a local open source model to do this however, to cut down on costs and make it far simpler to deal with than the unwieldy OpenAI systems.
Indeed, fine tuning with either synthetic data (as you are proposing) or human review works like that. you can read more here: https://huggingface.co/blog/rlhf
not an AI expert but from a talk I recently heard... if there is a mismatch in training data between the "teacher" LLM and "student" LLM, you risk teaching the student to hallucinate or to ignore information