Chapter 4 Missing values

4.1 Reason(s) Behind Analysing Missing Values

In order to make informed decisions with the data, we had to first understand what parts of the data were missing in order to handle the data appropriately. We decided to conduct an analysis of missing data in the Datafiniti dataset to understand what variables we were able to utilize to answer our questions concerning vegan/vegetarian restaurants. It is important to understand the overall structural nature of the data to get an idea what portion and how much of the data is missing, and how much is present.

4.2 Visual representation of missing values in dataset

For this dataset we were able to find out that 38.3% of the overall data are missing while 61.7% are present. Based on our results it appears that most of the data present have some association with location, address, region, menus, cuisines. Most of the data missing appear to be associated with the restaurant hours of operation, dress attire, and other restaurant business features.

The number of missing values in every column are listed as follows:

##   descriptions.dateSeen descriptions.sourceURLs      descriptions.value 
##                   10000                   10000                   10000 
##            features.key          features.value               hours.day 
##                   10000                   10000                   10000 
##              hours.dept              hours.hour         languagesSpoken 
##                   10000                   10000                   10000 
##                isClosed              yearOpened                     sic 
##                    9963                    9909                    9860 
##                 claimed         facebookPageURL                 twitter 
##                    9311                    9063                    8042 
##            paymentTypes       menus.description               imageURLs 
##                    6127                    6013                    4866 
##          menus.category      priceRangeCurrency           priceRangeMin 
##                    4070                    3673                    3673 
##           priceRangeMax                websites          menus.currency 
##                    3673                    1817                      53 
##                      id               dateAdded             dateUpdated 
##                       0                       0                       0 
##                 address              categories       primaryCategories 
##                       0                       0                       0 
##                    city                 country                cuisines 
##                       0                       0                       0 
##                    keys                latitude               longitude 
##                       0                       0                       0 
##             menuPageURL         menus.amountMax         menus.amountMin 
##                       0                       0                       0 
##          menus.dateSeen              menus.name        menus.sourceURLs 
##                       0                       0                       0 
##                    name                  phones              postalCode 
##                       0                       0                       0 
##                province              sourceURLs 
##                       0                       0

4.3 Missing data for every city

We examined missing data by city, and interestingly enough, the majority of the missing data counts are those from New York City. Observations from Brooklyn have the second highest number of missing data.

4.4 Analysis of missing data by row count and percentage

4.4.1 Analysis of missing data by row count.

4.4.2 Analysis of missing data by percentage.

Based on the resulting maps of missing data, we can note there may be a possibility of there being correlations between the first 9 variables listed on the map, as these variables all seem to have missing data within the same rows. 100% of the data within these rows also appear to be missing.The first missing data pattern displayed on the map accounts for just over 60% of the rows in the data. The second missing data pattern displayed on the map accounts for just over 35% of the rows of the data. There are no complete cases of no missing data within this dataset.