There are multiple ways of storing video.
Some of them just start with a few parameters about frame size and frame rate and whether they are black and white or color, and then have nothing but a continuous sequence of pixel values that can be read one frame at a time and displayed without decompression or decoding. Or the values might be stored as YCrCb instead of RGB, to the same effect. Or the pixels might be RGBG (<https://en.wikipedia.org/wiki/Bayer_filter Bayer filter>). As long as the encoder and decoder agree, the details of how the pixels are stored uncompressed do not matter much.
Modern video files like .avi are really "container files", which are ways to store and index a bunch of chunks, with each chunk including information about what kind of chunk it is and the chunk size. A file like a .avi can include a number of different kinds of chunks, including chunks for subtitles, chunks for audio, chunks for video, chunks for video at a different resolution (think Blue Ray or DVD that has to be able to handle multiple resolution demands.) Generally these days, the content of each chunks is typically given by some other standards organization -- so, for example, it would be generally be valid (but not typical) for a video file to represent frames as a series of .jpeg Files copied in without change.
(In turn, .jpeg files are internally represented as a series of chunks, with the representation of the chunks given by the JFIF format; the DCT components are one of the possible chunks within that file, and there is a standard about how those are stored that is not actually the same thing as the JFIF format commonly known as .jpeg files. The standard for representing DCT blocks gets re-used in other standards.)
The decoder looks for chunks identified as kinds of video that it prefers or at least knows how to decode, and ignores the rest.
One of the chunk formats for representing video is the standard H.264 / MPEG-4 which is "a block-oriented motion-compensation-based video compression standard. As of 2014 it is one of the most commonly used formats for the recording, compression, and distribution of video content". Notice that it is block oriented; a block is a portion of a frame where the block moves or changes together. H.264 and similar standards divide the raw video up into frames, and then have algorithms for figuring out how to efficiently encode the change between frames, and record that change in the data stream.
H.264 and the like have no concept of shots or scenes, just of frames and blocks within frames. When a video file such as .avi file is being created, it is possible that someone will also add chunks that identify "titles" and "chapters" and perhaps "scenes": those are not required, they are just nice to have to be able to present on menus.
H.264 and similar tend to encode into multiple kinds of frames: https://en.wikipedia.org/wiki/Video_compression_picture_types
- I‑frames are the least compressible but don't require other video frames to decode.
- P‑frames can use data from previous frames to decompress and are more compressible than I‑frames.
- B‑frames can use both previous and forward frames for data reference to get the highest amount of data compression.
Again, no "shot" or "scene" at all.
It is possible that there is a video format out there which does divide up into shots and scenes (or scenes and shots), but I cannot think of any such format in common use.