I went to Falls Creek for a ski trip before the Master course commenced. During my 3 days and 2 nights stay, the weather was pretty bad. I would like to understand more about the data structure of weather records from Bureau of Meteorology. I downloaded the data file “IDCJDW3027.201907.csv” (which recorded the July 2019 weather records in Falls Creek) and read that into R.
My data set includes:
I read and imported the data into R as a data frame, I have saved and named it as “fallsCreekJulyWeather”. I used the readr package to do so. The first 5 rows of the file is plain text info, which is irrelevant and shouldn’t be read in. Row 6 is the column header, in R, reading in \(^{\circ}\)C scrambled the “Min Temp” and “Max Temp” header, so I decided to skip that row as well, and simply specified the header of the table in the col_names arguments.
There are no data in some of the columns such as Evaporation (mm), Sunshine (hours), 9am cloud amount (oktas) etc. I skipped those columns by specifying “-” in the col_types arguments. I specified the following attributes in the col_types arguments to read in all the fields properly:
There are only 31 rows in the dataset, I used the head function and specified 31 as the second argument to check the entire read-in outputs. Below are the associated R codes:
# use read_csv to read-in the file IDCJDW3027.201907.csv, the first 6
# rows are skipped.
fallsCreekJulyWeather <- read_csv("src/R/data/IDCJDW3027.201907.csv", skip = 6, col_types = "-cddd--ficdi-fc-di-fc-",
col_names = c("Date", "Min Temp", "Max Temp", "Rainfall (mm)", "Direction of maximum wind Gust", "Speed of maximum wind gust (km/h)", "Time of maximum wind gust", "9am Temperature", "9am relative humidity", "9am wind direction", "9am wind speed (km/h)", "3pm Temperature", "3pm relative humidity", "3pm wind direction", "3pm wind speed (km/h)"))
# Check whether fallsCreekJulyWeather is a data frame
is.data.frame(fallsCreekJulyWeather)
## [1] TRUE
#View the read-in output
head(fallsCreekJulyWeather, 31)
## # A tibble: 31 x 15
## Date `Min Temp` `Max Temp` `Rainfall (mm)` `Direction of m~ `Speed of maxim~
## <chr> <dbl> <dbl> <dbl> <fct> <int>
## 1 1/7/~ -3.7 -0.8 1.6 NW 24
## 2 2/7/~ -4.2 -0.2 0 NNW 11
## 3 3/7/~ -3 1 0.8 SE 24
## 4 4/7/~ -2.3 4.3 0.2 ESE 19
## 5 5/7/~ -0.7 6.1 0 NNE 28
## 6 6/7/~ 1.3 6 0 NNW 37
## 7 7/7/~ 1.5 3.5 0 N 61
## 8 8/7/~ 1.5 2 17.6 N 52
## 9 9/7/~ -2.5 -0.6 3.4 NNW 54
## 10 10/7~ -2.1 -0.1 8.6 NW 43
## # ... with 21 more rows, and 9 more variables: `Time of maximum wind
## # gust` <chr>, `9am Temperature` <dbl>, `9am relative humidity` <int>, `9am
## # wind direction` <fct>, `9am wind speed (km/h)` <chr>, `3pm
## # Temperature` <dbl>, `3pm relative humidity` <int>, `3pm wind
## # direction` <fct>, `3pm wind speed (km/h)` <chr>
Inspect the data frame and variables using R functions :
I used the dim function to check the dimensions of the data frame.
I used the str and attributes function to check the attributes and structure of the data frame.
There are 3 similar factor variables,
All of these factors, are the classification of wind direction. To classify all the directions inclusively, there should be four cardinal directions <North - N, East - E, South - S, West - W>, four intercardinal directions <NE, SE, SW, NW> and eight more divisions <NNE, ENE, ESE, SSE, SSW, WSW, WNW, NNW>. I have specified all these into levels and order them accordingly from clockwise direction, starting from North - N.
The column names in the data frame are already renamed in the previous section. The row names are good with 1 to 31, which make the date of July self-explantory. I have no intention to alter anymore.
Below are my R codes with outputs.
# Dimension
dim(fallsCreekJulyWeather)
## [1] 31 15
# Structure
str(fallsCreekJulyWeather)
## tibble [31 x 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Date : chr [1:31] "1/7/2019" "2/7/2019" "3/7/2019" "4/7/2019" ...
## $ Min Temp : num [1:31] -3.7 -4.2 -3 -2.3 -0.7 1.3 1.5 1.5 -2.5 -2.1 ...
## $ Max Temp : num [1:31] -0.8 -0.2 1 4.3 6.1 6 3.5 2 -0.6 -0.1 ...
## $ Rainfall (mm) : num [1:31] 1.6 0 0.8 0.2 0 0 0 17.6 3.4 8.6 ...
## $ Direction of maximum wind Gust : Factor w/ 11 levels "NW","NNW","SE",..: 1 2 3 4 5 2 6 6 2 1 ...
## $ Speed of maximum wind gust (km/h): int [1:31] 24 11 24 19 28 37 61 52 54 43 ...
## $ Time of maximum wind gust : chr [1:31] "0:03" "1:43" "18:22" "20:28" ...
## $ 9am Temperature : num [1:31] -3.1 -3 -1.7 1 2.6 3.4 2.2 1.7 -2.1 -0.7 ...
## $ 9am relative humidity : int [1:31] 97 97 98 91 85 75 96 99 97 99 ...
## $ 9am wind direction : Factor w/ 7 levels "NW","NNW","WNW",..: 1 2 NA 3 4 4 2 3 5 2 ...
## $ 9am wind speed (km/h) : chr [1:31] "7" "6" "Calm" "6" ...
## $ 3pm Temperature : num [1:31] -1.6 -0.7 0.1 3.3 5.6 5 2.2 0.9 -0.9 -1 ...
## $ 3pm relative humidity : int [1:31] 98 98 99 88 72 73 99 99 99 99 ...
## $ 3pm wind direction : Factor w/ 5 levels "NNW","NNE","WNW",..: 1 NA NA 1 1 2 1 3 1 1 ...
## $ 3pm wind speed (km/h) : chr [1:31] "7" "Calm" "Calm" "4" ...
## - attr(*, "spec")=
## .. cols(
## .. col_skip(),
## .. Date = col_character(),
## .. `Min Temp` = col_double(),
## .. `Max Temp` = col_double(),
## .. `Rainfall (mm)` = col_double(),
## .. col_skip(),
## .. col_skip(),
## .. `Direction of maximum wind Gust` = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
## .. `Speed of maximum wind gust (km/h)` = col_integer(),
## .. `Time of maximum wind gust` = col_character(),
## .. `9am Temperature` = col_double(),
## .. `9am relative humidity` = col_integer(),
## .. col_skip(),
## .. `9am wind direction` = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
## .. `9am wind speed (km/h)` = col_character(),
## .. col_skip(),
## .. `3pm Temperature` = col_double(),
## .. `3pm relative humidity` = col_integer(),
## .. col_skip(),
## .. `3pm wind direction` = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
## .. `3pm wind speed (km/h)` = col_character(),
## .. col_skip()
## .. )
#Attributes
attributes(fallsCreekJulyWeather)
## $names
## [1] "Date" "Min Temp"
## [3] "Max Temp" "Rainfall (mm)"
## [5] "Direction of maximum wind Gust" "Speed of maximum wind gust (km/h)"
## [7] "Time of maximum wind gust" "9am Temperature"
## [9] "9am relative humidity" "9am wind direction"
## [11] "9am wind speed (km/h)" "3pm Temperature"
## [13] "3pm relative humidity" "3pm wind direction"
## [15] "3pm wind speed (km/h)"
##
## $class
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
##
## $row.names
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26 27 28 29 30 31
##
## $spec
## cols(
## col_skip(),
## Date = col_character(),
## `Min Temp` = col_double(),
## `Max Temp` = col_double(),
## `Rainfall (mm)` = col_double(),
## col_skip(),
## col_skip(),
## `Direction of maximum wind Gust` = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
## `Speed of maximum wind gust (km/h)` = col_integer(),
## `Time of maximum wind gust` = col_character(),
## `9am Temperature` = col_double(),
## `9am relative humidity` = col_integer(),
## col_skip(),
## `9am wind direction` = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
## `9am wind speed (km/h)` = col_character(),
## col_skip(),
## `3pm Temperature` = col_double(),
## `3pm relative humidity` = col_integer(),
## col_skip(),
## `3pm wind direction` = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
## `3pm wind speed (km/h)` = col_character(),
## col_skip()
## )
#convert Date from character to Date type, check the class
fallsCreekJulyWeather$Date <- as.Date(fallsCreekJulyWeather$Date, "%d/%m/%Y")
class(fallsCreekJulyWeather$Date)
## [1] "Date"
#convert Time of maximum wind gust from character to time of the associated date in posix, check the class
fallsCreekJulyWeather$`Time of maximum wind gust` <- paste(fallsCreekJulyWeather$Date, " ", fallsCreekJulyWeather$`Time of maximum wind gust`)
fallsCreekJulyWeather$`Time of maximum wind gust` <- strptime(fallsCreekJulyWeather$`Time of maximum wind gust`, "%Y-%m-%d %H:%M")
class(fallsCreekJulyWeather$`Time of maximum wind gust`)
## [1] "POSIXlt" "POSIXt"
# Order all the wind direction variables <Direction of maximum wind Gust, 9am wind direction, 3pm wind direction> and include all the possible directions: four cardinal directions <North - N, East - E, South - S, West - W>, four intercardinal directions <NE, SE, SW, NW> and eight more divisions <NNE, ENE, ESE, SSE, SSW, WSW, WNW, NNW>.
fallsCreekJulyWeather$`Direction of maximum wind Gust` <- factor(fallsCreekJulyWeather$`Direction of maximum wind Gust`,
levels = c("N", "NNE", "NE", "ENE", "E", "ESE", "SE", "SSE", "S", "SSW", "SW", "WSW", "W", "WNW", "NW", "NNW"), ordered = TRUE)
fallsCreekJulyWeather$`9am wind direction` <- factor(fallsCreekJulyWeather$`9am wind direction`,
levels = c("N", "NNE", "NE", "ENE", "E", "ESE", "SE", "SSE", "S", "SSW", "SW", "WSW", "W", "WNW", "NW", "NNW"), ordered = TRUE)
fallsCreekJulyWeather$`3pm wind direction` <- factor(fallsCreekJulyWeather$`3pm wind direction`,
levels = c("N", "NNE", "NE", "ENE", "E", "ESE", "SE", "SSE", "S", "SSW", "SW", "WSW", "W", "WNW", "NW", "NNW"), ordered = TRUE)
# Check the structure again after executing the above conversion and reorganising the factors
str(fallsCreekJulyWeather)
## tibble [31 x 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Date : Date[1:31], format: "2019-07-01" "2019-07-02" ...
## $ Min Temp : num [1:31] -3.7 -4.2 -3 -2.3 -0.7 1.3 1.5 1.5 -2.5 -2.1 ...
## $ Max Temp : num [1:31] -0.8 -0.2 1 4.3 6.1 6 3.5 2 -0.6 -0.1 ...
## $ Rainfall (mm) : num [1:31] 1.6 0 0.8 0.2 0 0 0 17.6 3.4 8.6 ...
## $ Direction of maximum wind Gust : Ord.factor w/ 16 levels "N"<"NNE"<"NE"<..: 15 16 7 6 2 16 1 1 16 15 ...
## $ Speed of maximum wind gust (km/h): int [1:31] 24 11 24 19 28 37 61 52 54 43 ...
## $ Time of maximum wind gust : POSIXlt[1:31], format: "2019-07-01 00:03:00" "2019-07-02 01:43:00" ...
## $ 9am Temperature : num [1:31] -3.1 -3 -1.7 1 2.6 3.4 2.2 1.7 -2.1 -0.7 ...
## $ 9am relative humidity : int [1:31] 97 97 98 91 85 75 96 99 97 99 ...
## $ 9am wind direction : Ord.factor w/ 16 levels "N"<"NNE"<"NE"<..: 15 16 NA 14 1 1 16 14 13 16 ...
## $ 9am wind speed (km/h) : chr [1:31] "7" "6" "Calm" "6" ...
## $ 3pm Temperature : num [1:31] -1.6 -0.7 0.1 3.3 5.6 5 2.2 0.9 -0.9 -1 ...
## $ 3pm relative humidity : int [1:31] 98 98 99 88 72 73 99 99 99 99 ...
## $ 3pm wind direction : Ord.factor w/ 16 levels "N"<"NNE"<"NE"<..: 16 NA NA 16 16 2 16 14 16 16 ...
## $ 3pm wind speed (km/h) : chr [1:31] "7" "Calm" "Calm" "4" ...
## - attr(*, "spec")=
## .. cols(
## .. col_skip(),
## .. Date = col_character(),
## .. `Min Temp` = col_double(),
## .. `Max Temp` = col_double(),
## .. `Rainfall (mm)` = col_double(),
## .. col_skip(),
## .. col_skip(),
## .. `Direction of maximum wind Gust` = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
## .. `Speed of maximum wind gust (km/h)` = col_integer(),
## .. `Time of maximum wind gust` = col_character(),
## .. `9am Temperature` = col_double(),
## .. `9am relative humidity` = col_integer(),
## .. col_skip(),
## .. `9am wind direction` = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
## .. `9am wind speed (km/h)` = col_character(),
## .. col_skip(),
## .. `3pm Temperature` = col_double(),
## .. `3pm relative humidity` = col_integer(),
## .. col_skip(),
## .. `3pm wind direction` = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
## .. `3pm wind speed (km/h)` = col_character(),
## .. col_skip()
## .. )
Subset the data frame using first 10 observations (include all variables). Then convert it to a matrix.
I checked the structure of the matrix, as matrix is homogeneous, everything has been converted to character.
Character is the data type with highest order in coercion, all the other data types have been coerced into Character in my matrix.
# Subset the data set and convert it to a matrix
subFallsCreekJulyWeather <- fallsCreekJulyWeather[1:10,]
# Viewing the subset and the first 10 rows with all the variables are listed.
head(subFallsCreekJulyWeather, 10)
## # A tibble: 10 x 15
## Date `Min Temp` `Max Temp` `Rainfall (mm)` `Direction of m~
## <date> <dbl> <dbl> <dbl> <ord>
## 1 2019-07-01 -3.7 -0.8 1.6 NW
## 2 2019-07-02 -4.2 -0.2 0 NNW
## 3 2019-07-03 -3 1 0.8 SE
## 4 2019-07-04 -2.3 4.3 0.2 ESE
## 5 2019-07-05 -0.7 6.1 0 NNE
## 6 2019-07-06 1.3 6 0 NNW
## 7 2019-07-07 1.5 3.5 0 N
## 8 2019-07-08 1.5 2 17.6 N
## 9 2019-07-09 -2.5 -0.6 3.4 NNW
## 10 2019-07-10 -2.1 -0.1 8.6 NW
## # ... with 10 more variables: `Speed of maximum wind gust (km/h)` <int>, `Time
## # of maximum wind gust` <dttm>, `9am Temperature` <dbl>, `9am relative
## # humidity` <int>, `9am wind direction` <ord>, `9am wind speed (km/h)` <chr>,
## # `3pm Temperature` <dbl>, `3pm relative humidity` <int>, `3pm wind
## # direction` <ord>, `3pm wind speed (km/h)` <chr>
# Checking the structure of the subset, it's the same as whole data set.
str(subFallsCreekJulyWeather)
## tibble [10 x 15] (S3: tbl_df/tbl/data.frame)
## $ Date : Date[1:10], format: "2019-07-01" "2019-07-02" ...
## $ Min Temp : num [1:10] -3.7 -4.2 -3 -2.3 -0.7 1.3 1.5 1.5 -2.5 -2.1
## $ Max Temp : num [1:10] -0.8 -0.2 1 4.3 6.1 6 3.5 2 -0.6 -0.1
## $ Rainfall (mm) : num [1:10] 1.6 0 0.8 0.2 0 0 0 17.6 3.4 8.6
## $ Direction of maximum wind Gust : Ord.factor w/ 16 levels "N"<"NNE"<"NE"<..: 15 16 7 6 2 16 1 1 16 15
## $ Speed of maximum wind gust (km/h): int [1:10] 24 11 24 19 28 37 61 52 54 43
## $ Time of maximum wind gust : POSIXlt[1:10], format: "2019-07-01 00:03:00" "2019-07-02 01:43:00" ...
## $ 9am Temperature : num [1:10] -3.1 -3 -1.7 1 2.6 3.4 2.2 1.7 -2.1 -0.7
## $ 9am relative humidity : int [1:10] 97 97 98 91 85 75 96 99 97 99
## $ 9am wind direction : Ord.factor w/ 16 levels "N"<"NNE"<"NE"<..: 15 16 NA 14 1 1 16 14 13 16
## $ 9am wind speed (km/h) : chr [1:10] "7" "6" "Calm" "6" ...
## $ 3pm Temperature : num [1:10] -1.6 -0.7 0.1 3.3 5.6 5 2.2 0.9 -0.9 -1
## $ 3pm relative humidity : int [1:10] 98 98 99 88 72 73 99 99 99 99
## $ 3pm wind direction : Ord.factor w/ 16 levels "N"<"NNE"<"NE"<..: 16 NA NA 16 16 2 16 14 16 16
## $ 3pm wind speed (km/h) : chr [1:10] "7" "Calm" "Calm" "4" ...
# Checking the class of the subset, it's still a data frame
class(subFallsCreekJulyWeather)
## [1] "tbl_df" "tbl" "data.frame"
# convert the subset from data frame to Matrix
subFallsCreekJulyWeather <- as.matrix(subFallsCreekJulyWeather)
# Check the class, now the subset is a matrix
class(subFallsCreekJulyWeather)
## [1] "matrix" "array"
# Check the structure of the matrix, it is character
str(subFallsCreekJulyWeather)
## chr [1:10, 1:15] "2019-07-01" "2019-07-02" "2019-07-03" "2019-07-04" ...
## - attr(*, "dimnames")=List of 2
## ..$ : NULL
## ..$ : chr [1:15] "Date" "Min Temp" "Max Temp" "Rainfall (mm)" ...
Subset the data frame including only first and the last variable in the data set, save it as an R object file (.RData). Below is the relevant R codes with explanations and outputs:
# Subset the dataset to include only the first and last variable
subFallsCreekJulyWeather1and15 <- fallsCreekJulyWeather[,c(1,15)]
# View this new subset, there are only 2 variables <columns> and 31 observations <rows>
head(subFallsCreekJulyWeather1and15, 31)
## # A tibble: 31 x 2
## Date `3pm wind speed (km/h)`
## <date> <chr>
## 1 2019-07-01 7
## 2 2019-07-02 Calm
## 3 2019-07-03 Calm
## 4 2019-07-04 4
## 5 2019-07-05 9
## 6 2019-07-06 7
## 7 2019-07-07 19
## 8 2019-07-08 15
## 9 2019-07-09 24
## 10 2019-07-10 26
## # ... with 21 more rows
# Check the structure of this subset, it is still a data frame with the same data type as the whole original data set.
str(subFallsCreekJulyWeather1and15)
## tibble [31 x 2] (S3: tbl_df/tbl/data.frame)
## $ Date : Date[1:31], format: "2019-07-01" "2019-07-02" ...
## $ 3pm wind speed (km/h): chr [1:31] "7" "Calm" "Calm" "4" ...
is.data.frame(subFallsCreekJulyWeather1and15)
## [1] TRUE
# Save this subset into a R object file
save(subFallsCreekJulyWeather1and15, file="output/R/2_Weather_1and15.Rdata")
Create a data frame with 2 variables and 4 observations. The data frame contains one integer variable intVar and one ordinal variable ordVar.
ordVar has been factored with proper order. I have checked the structure of my variables and the levels of the ordinal variable as below. After that I have created a numeric vector, Avg_waiting_time and use cbind() to add it to my data frame.
I then checked the attributes and the dimension of your new data frame. Below is the relevant R codes with explanations and outputs:
# A new data frame with 2 variables and 4 observations. It contains one integer variable and one ordinal variable.
# Integer Variable with 4 observations
intVar <- c(1L, 2L, 3L, 4L)
# Ordinal Variable with 4 observations. It has been ordered properly
ordVar <- c("slow", "fast", "normal", "express")
ordVar <- factor(ordVar, levels=c("slow","normal", "fast", "express"), ordered=TRUE)
# Print these 2 variables
intVar
## [1] 1 2 3 4
ordVar
## [1] slow fast normal express
## Levels: slow < normal < fast < express
# Combine these 2 variables into a data frame
df_q6 <- data.frame(col1=intVar, col2=ordVar)
# Assign these 2 variables with proper names and print the data frame
colnames(df_q6) <- c("Queue No", "Speed")
df_q6
## Queue No Speed
## 1 1 slow
## 2 2 fast
## 3 3 normal
## 4 4 express
# Check the structure of the data frame
str(df_q6)
## 'data.frame': 4 obs. of 2 variables:
## $ Queue No: int 1 2 3 4
## $ Speed : Ord.factor w/ 4 levels "slow"<"normal"<..: 1 3 2 4
# Create a numeric vector
Avg_waiting_time <- c(8.8, 3.6, 6.3, 2.1)
# Check the class of the numeric vector
class(Avg_waiting_time)
## [1] "numeric"
# Use cbind to add onto the dataframe
df_q6 <- cbind(df_q6, Avg_waiting_time)
# Check the attributes of the new dataframe, it shows the column and row name. The class of df_q6 is stil a data frame.
attributes(df_q6)
## $names
## [1] "Queue No" "Speed" "Avg_waiting_time"
##
## $class
## [1] "data.frame"
##
## $row.names
## [1] 1 2 3 4
# Check the structure of the new dataframe, there are now 3 variables, Queue No is integer, Speed is an ordered factor, Avg_waiting_time is numeric
str(df_q6)
## 'data.frame': 4 obs. of 3 variables:
## $ Queue No : int 1 2 3 4
## $ Speed : Ord.factor w/ 4 levels "slow"<"normal"<..: 1 3 2 4
## $ Avg_waiting_time: num 8.8 3.6 6.3 2.1
# Check the dimension of the new dataframe, it's now 4 observations <rows> with 3 variables <columns>
dim(df_q6)
## [1] 4 3
# Finally print out the new dataframe
df_q6
## Queue No Speed Avg_waiting_time
## 1 1 slow 8.8
## 2 2 fast 3.6
## 3 3 normal 6.3
## 4 4 express 2.1