Wandering Generalities
The individual clips are remarkable and jaw-dropping for the technology they represent, but the use of the clips depends on your understanding of implicit or explicit shot generation.
Suppose you ask SORA for a long tracking shot in a kitchen with a banana on a table. In that case, it will rely on its implicit understanding of ‘banana-ness’ to generate a video showing a banana.
Through training data, it has ‘learnt’ the implicit aspects of banana-ness: such as ‘yellow’, ‘bent’, ‘has dark ends’, etc. It has no actual recorded images of bananas. It has no ‘banana stock library’ database; it has a much smaller compressed hidden or ‘latent space’ of what a banana is.
Every time it runs, it shows another interpretation of that latent space. Your prompt replies on an implicit understanding of banana-ness.
“I have that within which passes show / these just the trappings and the suits of banana.”