That explanation makes no fucking sense and makes them look like they know fuck all about AI training.
The output keywords have nothing to do with the training data. If the model in use has fuck all BME training data, it will struggle to draw a BME regardless of what key words are used.
And any AI person training their algorithms on AI generated data is liable to get fired. That is a big no-no. Not only does it not provide any new information from the data, it also amplifies the mistakes made by the AI.
There are 2 problems with not having enough diversity in training data:
The AI will be worse at depicting diversity when prompted, eg. If the AI hasn’t seen enough pictures of black people it may not be able to depict black hair properly as it doesn’t “know what it looks like”
The AI will not show as much diversity when not prompted. The AI is working off statistics so if you tell it to depict a person and most of the people it’s “seen” are white men it will almost always depict a white man because that’s statistically what a person is according to its data.
This method combats the second problem, but not the first. The first can mostly be solved by generally scaling the training data though, which is mostly what these companies have been doing. Even if only 1% of your images are of POC, if you have 1b images 10mil will be of POC which may be enough to train it. The second problem would remain unsolved though since the AI will always go with the statistically safe 99%.
any AI person training their algorithms on AI generated data is liable to get fired
though this isn’t pertinent to the post in question, training AI (and by AI I presume you mean neural networks, since there’s a fairly important distinction) on AI-generated data is absolutely a part of machine learning.
some of the most famous neural networks out there are trained on data that they’ve generated themselves -> e.g., AlphaGo Zero
They could try to compensate the imbalance by explicitly asking for the lesser represented classes in the data… It’s an idea, not quite bad but not quite good either because of the problems you mentioned.
That explanation makes no fucking sense and makes them look like they know fuck all about AI training.
The output keywords have nothing to do with the training data. If the model in use has fuck all BME training data, it will struggle to draw a BME regardless of what key words are used.
And any AI person training their algorithms on AI generated data is liable to get fired. That is a big no-no. Not only does it not provide any new information from the data, it also amplifies the mistakes made by the AI.
There are 2 problems with not having enough diversity in training data:
The AI will be worse at depicting diversity when prompted, eg. If the AI hasn’t seen enough pictures of black people it may not be able to depict black hair properly as it doesn’t “know what it looks like”
The AI will not show as much diversity when not prompted. The AI is working off statistics so if you tell it to depict a person and most of the people it’s “seen” are white men it will almost always depict a white man because that’s statistically what a person is according to its data.
This method combats the second problem, but not the first. The first can mostly be solved by generally scaling the training data though, which is mostly what these companies have been doing. Even if only 1% of your images are of POC, if you have 1b images 10mil will be of POC which may be enough to train it. The second problem would remain unsolved though since the AI will always go with the statistically safe 99%.
though this isn’t pertinent to the post in question, training AI (and by AI I presume you mean neural networks, since there’s a fairly important distinction) on AI-generated data is absolutely a part of machine learning.
some of the most famous neural networks out there are trained on data that they’ve generated themselves -> e.g., AlphaGo Zero
They could try to compensate the imbalance by explicitly asking for the lesser represented classes in the data… It’s an idea, not quite bad but not quite good either because of the problems you mentioned.