Hey I work on things like this! My time to shine....
I have seen proposals on this and although I haven't seen them in action, they could very well be done. This is called Computer Vision, it is a study of how data is received through visual light and then processed intelligently. Essentially the aim is to have a camera system not just gather images, but infer context/information from it. Usually when a camera reads images, it is gathering a large array or matrix of numeric color data. Using computational methods a system can observe changes in this numeric data and decipher trends. This is an aspect called machine learning, where statistical methods and error/cost analysis are used in tandem with a large set of known data to generate predictions. If you have heard of "Neural Nets" then this is a direct application for them. These models are trained by using a lot of already known data called data-sets. In your case there would be a bunch of images of people in their cars with corresponding label as to whether their seat-belt is on or off or if they are on their phone or not. This data is used to train or learn a model. If given enough good data in the right ways, the system could decide based on prior knowledge if a person is on their phone or not. It gets a lot more complex as there are lots of steps to getting a successful system but that's the gist. Essentially a street camera will gather the raw data and offload it either to a networked device or local device to process the prediction.
If you are more curious I can expand on any of the above. Also, check this out https://github.com/tensorflow/models/tree/master/research/object_detection
This is a common framework that I used for traffic sign detection/recognition through cameras.