| Contents |
|---|
| Dataset Description |
| Columns Descreption |
| Questions for Analysis |
| Data Wrangling |
| Data Cleaning |
| Exploratory Data Analysis |
| Built with |
This data set contains information about 10,000 movies extracted from TMDB. The dataset contains movies from 1960 to 2015. Including user ratings and revenue. Original data from Kaggle
id, imdb_id: unique id or imdb id for each movie on TMDBpopularity: a metric used to measure the popularity of the movie.budget:the total budget of the moviein USD.revenue:the total revenue of the movie in USD.original_title: the original title of the movie.cast:the names of the cast of the movie separated by "|".homepage: the website of the movie (if it existed).director:name(s) of the director(s) of the movie (separated by "|" if there are more than one director).tagline:a catchphrase describing the movie.keywords: keywords related to the movie.overview:summary of the plot of the movie.runtime:total runtime of the movie in minutes.genres: genres of the movie separated by "|".production_companies:production compan(y/ies) of the movie.release_date:release date of the movie.vote_count:number of voters of te movie.vote_average:the average user rating of the movierelease_year:release year of the movie (from 1960 to 2015)budget_adj:the total budget of the moviein USD in terms of 2010 dollars, accounting for inflation over time.revenue_adj:the total budget of the movie in USD in terms of 2010 dollars, accounting for inflation over time.
- Do movies with high popularity achive high revenvue?
- What are the most filmed genres in this whole dataset?
- Is there a correlation between a movie budget and its revenue?
Our data can be found on tmdb-movies.csv file provided on this repository. It is an edited version of the original Kaggle's TMDB 5000 Movie Dataset provided by Udacity on the Become a Data Analyst Nanodegree Program.
Main Observations:
- Our dataset consisted of a total of 10866 rows and 21 columns.
- We had only 1 duplicated row which had been dropped.
- Some columns wont be useful in answering our questions so they were dropped.
- Few columns had many missing values that needed to be handled.
- Columns
castdirectorgenrehad values saperated with a '|'. -
release_date's data type needed to be casted. - We could append a column for the movie
profitusing the formula:$profit = revenue - budget$ . -
vote_averagebetter be presented as a catecorical variable that groubs multible ratings values. - We might also catigorize
profitcolumn for better EDA
After finishing our dataset cleaning, we endded up with a total of 10840 records and 10 columns. The dataset now has no duplicates nor null values, and the data types are consistant with suitable categorical variable to address our questions. We then perfomed some analytics and created some visualizations to answer our targeted questions.
More popular movies recieve way more revenue than the less popular movies.
Drama,ComedyandActionare the most three filmed genres in total of 10839 movies in our dataset.Dramagenre alone is filmed 22.6% of the times on our dataset.
There is positive correlation between
budgetandrevenue, indecating a relation between them with little outliers found.
- JupyterLab
- Python3
- Pandas
- Numpy