API Reference
create_dashboard_main_page(inference_predictions)
Creates the dashboard page for the Bavarian Forest National Park visitor information. This includes the visitor count, parking, weather, recreation, and other information.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
inference_predictions |
DataFrame
|
The inference predictions for region-wise visitor counts. |
required |
Source code in Dashboard.py
43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 | |
run_training()
Runs the training pipeline. This includes sourcing and preprocessing the data, training the model, and saving the model.
Source code in Dashboard.py
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 | |
get_latest_parking_data_and_visualize_it()
Display the parking section of the dashboard with a map showing the real-time parking occupancy and interactive metrics with actual numbers of visitors. It will update every 15 minutes.
Source code in pages/Admin_🔓.py
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 | |
get_visitor_predictions_section()
Build the visitor predictions section by running/loading the inference pipeline and displaying the predictions in actual number of visitors.
Source code in pages/Admin_🔓.py
23 24 25 26 27 28 29 30 31 32 | |
add_spatial_info_to_parking_sensors(parking_data_df)
Add spatial information to the parking dataframe.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
parking_data_df |
DataFrame
|
DataFrame containing parking sensor data (occupancy, capacity, occupancy rate). |
required |
Returns:
| Name | Type | Description |
|---|---|---|
parking_data_df |
DataFrame
|
DataFrame containing parking sensor data with spatial information. |
Source code in src/streamlit_app/source_data.py
103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 | |
merge_all_df_from_list(df_list)
Merge all the dataframes in the list into a single dataframe.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df_list |
list
|
A list of pandas DataFrames to merge. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
merged_dataframe |
DataFrame
|
The merged DataFrame. |
Source code in src/streamlit_app/source_data.py
123 124 125 126 127 128 129 130 131 132 133 134 135 | |
source_and_preprocess_forecasted_weather_data(timestamp_latest_weather_data_fetch)
Source and preprocess the forecasted weather data for the Bavarian Forest National Park.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
timestamp_latest_weather_data_fetch |
datetime
|
The timestamp of the latest weather data fetch. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
sourced_and_preprocessed_weather_data |
DataFrame
|
Processed forecasted weather dataframe |
Source code in src/streamlit_app/source_data.py
211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 | |
source_and_preprocess_realtime_parking_data(current_timestamp)
Source and preprocess the real-time parking data. Returns the timestamp of when the function was run.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
current_timestamp |
datetime
|
The timestamp of when the function was run. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
processed_parking_data |
DataFram
|
Preprocessed real-time parking data. |
Source code in src/streamlit_app/source_data.py
138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 | |
source_parking_data_from_cloud(location_slug)
Sources the current occupancy data from the Bayern Cloud API.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
location_slug |
str
|
The location slug of the parking sensor. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
parking_df_with_spatial_info |
DataFrame
|
A DataFrame containing the current occupancy data, occupancy rate, capacity and spatial coordinates. |
Source code in src/streamlit_app/source_data.py
59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 | |
source_weather_data(start_time)
Source forecasted weather data from the Meteostat API for the Bavarian Forest National Park in the next 7 days in hourly intervals.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
start_time |
datetime
|
The start time of the weather data. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
weather_hourly |
DataFrame
|
Hourly weather data for the Bavarian Forest National Park for the next 7 days |
Source code in src/streamlit_app/source_data.py
181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 | |
fill_missing_values(data, parameters)
Fill missing values in the weather data using linear interpolation or zero values.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data |
DataFrame
|
Processed hourly weather data. |
required |
parameters |
list
|
List of column names to process. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
data |
DataFrame
|
DataFrame with missing values filled. |
Source code in src/streamlit_app/pre_processing/process_forecast_weather_data.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 | |
process_weather_data(weather_data_df)
Process the hourly weather data by filling missing values.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
weather_data_df |
DataFrame
|
Hourly weather data. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
imputed_data |
DataFrame
|
Processed weather data with missing values filled. |
Source code in src/streamlit_app/pre_processing/process_forecast_weather_data.py
58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 | |
impute_missing_data(all_parking_data)
Impute missing values in the parking data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
all_parking_data |
DataFrame
|
Raw parking data. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
all_parking_data |
DataFrame
|
Processed parking data. |
Source code in src/streamlit_app/pre_processing/process_real_time_parking_data.py
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 | |
process_real_time_parking_data(parking_data_df)
Process the real-time parking data by imputing missing values.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
parking_data_df |
DataFrame
|
Raw real-time parking data. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
clean_parking_data |
DataFrame
|
Processed real-time parking data. |
Source code in src/streamlit_app/pre_processing/process_real_time_parking_data.py
39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 | |
convert_sensor_dictionary_to_excel_file(sensor_dict, output_file_path)
Convert sensor dictionary to a Pandas Dataframe and save it as an Excel file.
Info: This function is not used as of now, but might be useful in the future for handling changes to the sensor configuration.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sensor_dict |
dict
|
A dictionary containing sensor data. |
required |
output_file_path |
str
|
The path to the output Excel file. |
required |
Returns:
| Type | Description |
|---|---|
None
|
None |
Source code in src/streamlit_app/pre_processing/data_quality_check.py
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 | |
convert_sensor_excel_file_to_dictionary(sensor_file_path)
Convert Excel file containing sensor configuration data to a dictionary.
Info: This function is not used as of now, but might be useful in the future for handling changes to the sensor configuration.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sensor_file_path |
str
|
The path to the Excel file. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
dict |
dict
|
A dictionary containing sensor configuration. |
Source code in src/streamlit_app/pre_processing/data_quality_check.py
52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 | |
int_for_all_counts(df)
Convert all numeric columns in the DataFrame to integer type. Round float values that are not integers, and replace NaN values with 0 to allow conversion to integers.
Source code in src/streamlit_app/pre_processing/data_quality_check.py
128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 | |
parse_german_dates(df, date_column_name)
Parses German dates in the specified date column of the DataFrame using regex, including hours and minutes if available.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df |
DataFrame
|
The DataFrame containing the date column. |
required |
date_column_name |
str
|
The name of the date column. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: The DataFrame with parsed German dates. |
Source code in src/streamlit_app/pre_processing/data_quality_check.py
201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 | |
calculate_color(occupancy_rate)
Calculate the color of the marker based on the occupancy rate.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
occupancy_rate |
float
|
The occupancy rate of the parking section. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
list |
A list of RGB values representing the color of the marker. |
Source code in src/streamlit_app/pages_in_dashboard/admin/parking.py
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | |
get_fixed_size()
Get a fixed size value for the map markers.
Source code in src/streamlit_app/pages_in_dashboard/admin/parking.py
8 9 10 11 12 | |
get_parking_section(processed_parking_data)
Display the parking section of the dashboard with a map showing the real-time parking occupancy and interactive metrics.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
processed_parking_data |
DataFrame
|
Processed parking data. |
required |
Returns:
| Type | Description |
|---|---|
|
None |
Source code in src/streamlit_app/pages_in_dashboard/admin/parking.py
36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 | |
check_password()
Returns True if the user had the correct password.
Source code in src/streamlit_app/pages_in_dashboard/admin/password.py
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | |
visitor_prediction_graph(inference_predictions)
Get the visitor counts section with the highest occupancy rate.
Returns:
| Type | Description |
|---|---|
|
None |
Source code in src/streamlit_app/pages_in_dashboard/admin/visitor_count.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 | |
convert_number_to_month_name(month)
Convert a month number (1-12) to its corresponding month name.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
month |
int
|
The month number. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
The name of the month. |
Source code in src/streamlit_app/pages_in_dashboard/data_accessibility/data_retrieval.py
46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 | |
create_temporal_columns(df)
Create temporal columns from the DataFrame index.
This function takes a DataFrame with a datetime index and creates additional columns for month names, year, and season based on the index.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df |
DataFrame
|
The input DataFrame with a datetime index. |
required |
Returns:
| Type | Description |
|---|---|
|
pd.DataFrame: The original DataFrame with added columns: - 'month': The name of the month corresponding to the index. - 'year': The year extracted from the index. - 'season': The name of the season corresponding to the index. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the index of the DataFrame cannot be converted to datetime. |
Source code in src/streamlit_app/pages_in_dashboard/data_accessibility/data_retrieval.py
298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 | |
extract_values_according_to_type(selected_query, type)
Extract values from a query string based on the specified query type.
This function uses regular expressions to extract relevant values from the
selected_query string according to the specified type. The extracted values
may include properties, sensors, dates, months, seasons, and years, depending on the type.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
selected_query |
str
|
The selected query string from which to extract values. |
required |
type |
str
|
The type of the query. Options include: - 'type1': Query for date range. - 'type2': Query for month and year. - 'type3': Query for season and year. - 'type4': Query for date range (weather category). - 'type5': Query for month and year (weather category). - 'type6': Query for season and year (weather category). |
required |
Returns:
| Name | Type | Description |
|---|---|---|
dict |
A dictionary containing the extracted values, where keys are based on
the field names specified in |
Raises:
| Type | Description |
|---|---|
AttributeError
|
If the expected regex match is not found in the selected_query. |
Source code in src/streamlit_app/pages_in_dashboard/data_accessibility/data_retrieval.py
68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 | |
get_data_from_query(selected_category, selected_query, selected_query_type, start_date, end_date, selected_sensors)
Retrieve data based on the selected category and query.
This function extracts values from the provided query, retrieves data from AWS based on the selected category, processes the data, and returns a DataFrame containing the queried information.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
selected_category |
str
|
The category of data to retrieve. Options include: - 'visitor_sensors' - 'parking' - 'weather' - 'visitor_centers' |
required |
selected_query |
str
|
The query string used to extract specific values. |
required |
selected_query_type |
str
|
The type of the query, which determines the format of the expected values. |
required |
Returns:
| Type | Description |
|---|---|
|
pd.DataFrame: A DataFrame containing the filtered data based on the query. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the selected category is not recognized. |
KeyError
|
If the expected values are not found in the query. |
Source code in src/streamlit_app/pages_in_dashboard/data_accessibility/data_retrieval.py
487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 | |
get_parking_data_for_selected_sensor(selected_sensor)
Fetches parking data for a specified sensor from S3.
This function searches through a list of S3 object paths to find the most relevant object that contains the specified sensor name. It then retrieves the parking data from the corresponding Parquet file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
objects |
list
|
A list of S3 object paths to search for the selected sensor. |
required |
selected_sensor |
str
|
The name of the sensor to filter the objects. |
required |
Returns:
| Type | Description |
|---|---|
|
pandas.DataFrame: A DataFrame containing the parking data read |
|
|
from the Parquet file. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the selected sensor is not found in any object. |
Source code in src/streamlit_app/pages_in_dashboard/data_accessibility/data_retrieval.py
410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 | |
get_queried_df(processed_category_df, get_values, type, selected_category, start_date, end_date)
Retrieve a filtered DataFrame based on the selected category and query type.
This function filters the input DataFrame processed_category_df according to the
specified selected_category and type. It uses values provided in the get_values
dictionary to perform the filtering.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
processed_category_df |
DataFrame
|
The DataFrame containing processed data. |
required |
get_values |
dict
|
A dictionary containing values for filtering, including: - 'property' (str): The property to select from the DataFrame. - 'start_date' (str): Start date for filtering (format: 'YYYY-MM-DD'). - 'end_date' (str): End date for filtering (format: 'YYYY-MM-DD'). - 'month' (int): Month for filtering. - 'year' (int): Year for filtering. - 'season' (str): Season for filtering (e.g., 'spring', 'summer'). |
required |
type |
str
|
The type of query to perform. Options include: - 'type1': Filter by date range. - 'type2': Filter by month and year. - 'type3': Filter by season and year. - 'type4': Filter by date range (weather category). - 'type5': Filter by month and year (weather category). - 'type6': Filter by season and year (weather category). |
required |
selected_category |
str
|
The category to filter by. Options include: - 'parking' - 'weather' - 'visitor_centers' - 'visitor_sensors' |
required |
Returns:
| Type | Description |
|---|---|
|
pd.DataFrame: A DataFrame containing the filtered data for the specified property. |
Raises:
| Type | Description |
|---|---|
KeyError
|
If 'property' is not in |
ValueError
|
If an invalid type or selected_category is provided. |
Source code in src/streamlit_app/pages_in_dashboard/data_accessibility/data_retrieval.py
151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 | |
get_sensors_data()
Fetches sensor data from the most recently modified object.
This function retrieves the sensor data from a specified object in S3 by reading a CSV file. It selects the last object from the provided list of objects, assuming this is the most recently modified.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
objects |
list
|
A list of S3 object paths, where the last object is the most recently modified. |
required |
Returns:
| Type | Description |
|---|---|
|
pandas.DataFrame: A DataFrame containing the sensor data read |
|
|
from the CSV file. |
Source code in src/streamlit_app/pages_in_dashboard/data_accessibility/data_retrieval.py
337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 | |
get_visitor_centers_data(objects)
Fetches visitor centers data from the most recently modified Excel file.
This function retrieves visitor centers data from a specified Excel file in S3. It selects the last object from the provided list of objects that is an Excel file (with extensions '.xlsx' or '.xls'), assuming this is the most recently modified Excel file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
objects |
list
|
A list of S3 object paths, where the last object ending in '.xlsx' or '.xls' is the most recently modified Excel file. |
required |
Returns:
| Type | Description |
|---|---|
|
pandas.DataFrame: A DataFrame containing the visitor centers |
|
|
data read from the Excel file. |
Source code in src/streamlit_app/pages_in_dashboard/data_accessibility/data_retrieval.py
356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 | |
get_weather_data(objects)
Fetches weather data from the most recently modified object.
This function retrieves weather data from a specified object in S3 by reading an Parquet file. It selects the last object from the provided list of objects, assuming this is the most recently modified.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
objects |
list
|
A list of S3 object paths, where the last object is the most recently modified. |
required |
Returns:
| Type | Description |
|---|---|
|
pandas.DataFrame: A DataFrame containing the weather |
|
|
data read from the Parquet file. |
Source code in src/streamlit_app/pages_in_dashboard/data_accessibility/data_retrieval.py
386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 | |
parse_german_dates_regex(df, date_column_name)
Parses German dates in the specified date column of the DataFrame using regex, including hours and minutes if available.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df |
DataFrame
|
The DataFrame containing the date column. |
required |
date_column_name |
str
|
The name of the date column. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: The DataFrame with parsed German dates. |
Source code in src/streamlit_app/pages_in_dashboard/data_accessibility/data_retrieval.py
437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 | |
list_files_in_s3(category)
Lists files in S3 for a given category and returns only file names.
Source code in src/streamlit_app/pages_in_dashboard/data_accessibility/download.py
10 11 12 13 14 | |
load_csv_files_from_aws_s3(path, **kwargs)
Loads individual or multiple CSV files from an AWS S3 bucket.
Source code in src/streamlit_app/pages_in_dashboard/data_accessibility/download.py
16 17 18 19 | |
generate_queries(category, start_date, end_date, selected_properties, selected_sensors)
Generate queries based on the selected category and date range.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
category |
str
|
The category of data (e.g., 'parking', 'weather', 'visitor_sensors', 'visitor_centers'). |
required |
start_date |
str
|
The start date for the queries. |
required |
end_date |
str
|
The end date for the queries. |
required |
selected_properties |
list
|
List of selected properties relevant to the category. |
required |
selected_sensors |
list
|
List of selected sensors relevant to the category. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
dict |
A dictionary containing generated queries based on the specified category and filters. |
Source code in src/streamlit_app/pages_in_dashboard/data_accessibility/query_box.py
283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 | |
get_queries_for_parking(start_date, end_date, selected_properties, selected_sensors)
Generate queries for parking data based on selected date range, properties, and sensors.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
start_date |
str
|
The start date for the query. |
required |
end_date |
str
|
The end date for the query. |
required |
selected_properties |
list
|
List of parking properties to include in the query (e.g., occupancy, capacity). |
required |
selected_sensors |
list
|
List of parking sensors to query data for. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
dict |
A dictionary with keys "type1", "type2", and "type3" containing queries for the date range. |
Source code in src/streamlit_app/pages_in_dashboard/data_accessibility/query_box.py
197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 | |
get_queries_for_visitor_centers(start_date, end_date, selected_sensors)
Generate queries for visitor center data based on selected date range,and sensors.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
start_date |
str
|
The start date for the query. |
required |
end_date |
str
|
The end date for the query. |
required |
selected_sensors |
list
|
List of visitor center sensors to query data for. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
dict |
A dictionary with keys "type4", "type5", and "type6" containing queries for the date range. |
Source code in src/streamlit_app/pages_in_dashboard/data_accessibility/query_box.py
233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 | |
get_queries_for_visitor_sensors(start_date, end_date, selected_sensors)
Generate queries for visitor sensor data based on selected date range and sensors.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
start_date |
str
|
The start date for the query. |
required |
end_date |
str
|
The end date for the query. |
required |
selected_sensors |
list
|
List of visitor sensors to query data for. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
dict |
A dictionary with keys "type4", "type5", and "type6" containing queries for the date range. |
Source code in src/streamlit_app/pages_in_dashboard/data_accessibility/query_box.py
258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 | |
get_query_section()
Get the query section for data selection and execution.
This function displays a user interface for selecting data categories, date ranges, and additional filters. It allows users to generate specific queries and execute them to retrieve data.
Returns:
| Name | Type | Description |
|---|---|---|
None |
This function does not return any values but updates the Streamlit UI with the selected query results and visualizations. |
Source code in src/streamlit_app/pages_in_dashboard/data_accessibility/query_box.py
309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 | |
select_category()
Select the category of data to access using st.selectbox from Streamlit.
Returns:
| Name | Type | Description |
|---|---|---|
category |
str
|
The category selected by the user. |
Source code in src/streamlit_app/pages_in_dashboard/data_accessibility/query_box.py
9 10 11 12 13 14 15 16 17 18 19 20 21 | |
select_date()
Select the start and end date for data access using date inputs in Streamlit.
Returns:
| Name | Type | Description |
|---|---|---|
tuple |
The selected start and end date in the format "MM-DD-YYYY". |
Source code in src/streamlit_app/pages_in_dashboard/data_accessibility/query_box.py
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 | |
select_filters(category, start_date, end_date)
Select additional filters such as sensors, weather values, or parking values.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
category |
str
|
The category selected by the user. Can be one of: "weather", "parking", "visitor_sensors", "visitor_centers". |
required |
Returns:
| Name | Type | Description |
|---|---|---|
tuple |
A tuple containing selected_properties and selected_sensors. |
Source code in src/streamlit_app/pages_in_dashboard/data_accessibility/query_box.py
58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 | |
get_visualization_section(retrieved_df)
Get the visualization section.
Source code in src/streamlit_app/pages_in_dashboard/data_accessibility/query_viz_and_download.py
8 9 10 11 12 13 14 15 16 17 18 | |
generate_file_name(category, upload_timestamp)
Generates a file name based on the category.
Source code in src/streamlit_app/pages_in_dashboard/data_accessibility/upload.py
16 17 18 | |
write_csv_file_to_aws_s3(df, path, **kwargs)
Writes a CSV file to AWS S3.
Source code in src/streamlit_app/pages_in_dashboard/data_accessibility/upload.py
12 13 14 | |
get_other_information()
Get the other information section.
Source code in src/streamlit_app/pages_in_dashboard/visitors/other_information.py
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | |
get_page_layout()
Set the page layout for the Streamlit app.
Returns:
| Type | Description |
|---|---|
|
col1, col2: The two columns of the page layout. |
Source code in src/streamlit_app/pages_in_dashboard/visitors/page_layout_config.py
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | |
calculate_color_based_on_occupancy_rate(occupancy_rate)
Calculate the color of the marker based on the occupancy rate. Returns a named tuple with the RGB values and a CSS gradient color value.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
occupancy_rate |
float
|
The occupancy rate of the parking section. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
list |
dict
|
A list of RGB values representing the color of the marker. |
Source code in src/streamlit_app/pages_in_dashboard/visitors/parking.py
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | |
get_fixed_size()
Get a fixed size value for the map markers.
Source code in src/streamlit_app/pages_in_dashboard/visitors/parking.py
10 11 12 13 14 | |
get_occupancy_status(occupancy_rate)
Get the occupancy status (High, Medium, Low) based on the occupancy rate.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
occupancy_rate |
float
|
The occupancy rate of the parking section. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
The occupancy status ("High", "Medium", "Low"). |
Source code in src/streamlit_app/pages_in_dashboard/visitors/parking.py
42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 | |
get_parking_section()
Display the parking section of the dashboard with a map showing the real-time parking occupancy and interactive metrics.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
processed_parking_data |
DataFrame
|
Processed parking data. |
required |
Returns:
| Type | Description |
|---|---|
|
None |
Source code in src/streamlit_app/pages_in_dashboard/visitors/parking.py
83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 | |
render_occupancy_bar(occupancy_rate)
Render a color bar representing the occupancy rate using HTML and CSS.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
occupancy_rate |
float
|
The occupancy rate of the parking section. |
required |
Returns:
| Type | Description |
|---|---|
|
None |
Source code in src/streamlit_app/pages_in_dashboard/visitors/parking.py
59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 | |
get_recreation_section()
Get the recreational activities section for the Bavarian Forest National Park.
Returns: None
Source code in src/streamlit_app/pages_in_dashboard/visitors/recreational_activities.py
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 | |
get_visitor_counts_section(inference_predictions)
Get the visitor counts section with the highest occupancy rate.
Returns:
| Type | Description |
|---|---|
|
None |
Source code in src/streamlit_app/pages_in_dashboard/visitors/visitor_count.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 | |
find_peaks(data)
Find peaks in the data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data |
Series
|
The data to find peaks in. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
list |
A list of indices where peaks occur. |
Source code in src/streamlit_app/pages_in_dashboard/visitors/weather.py
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | |
get_graph(forecast_data)
Display a line graph of the temperature forecast in the same plot, with clear day labels on the x-axis and properly formatted hover info.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
forecast_data |
DataFrame
|
The forecast data to plot. |
required |
Returns:
| Type | Description |
|---|---|
|
plotly.graph_objects.Figure: The plotly figure object. |
Source code in src/streamlit_app/pages_in_dashboard/visitors/weather.py
29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 | |
get_weather_section()
Display the weather section of the dashboard.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
processed_weather_data |
DataFrame
|
Processed weather data. |
required |
Returns:
| Type | Description |
|---|---|
|
None |
Source code in src/streamlit_app/pages_in_dashboard/visitors/weather.py
109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 | |
get_historical_data_for_location(location_id, location_slug, data_type, api_endpoint_suffix, column_name, save_file_path='outputs')
Fetch historical data from the BayernCloud API and save it as a CSV file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
location_id |
str
|
The ID of the location for which the data is to be fetched. |
required |
location_slug |
str
|
A slug (a URL-friendly string) representing the location. |
required |
data_type |
str
|
The type of data being fetched (e.g., 'occupancy', 'occupancy_rate', 'capacity'). |
required |
api_endpoint_suffix |
str
|
The specific suffix of the API endpoint for the data type (e.g., 'dcls_occupancy', 'dcls_occupancy_rate'). |
required |
column_name |
str
|
The name of the column to store the fetched data in the DataFrame. |
required |
save_file_path |
str
|
The base directory where the CSV file will be saved (default is 'outputs'). |
'outputs'
|
Returns:
| Name | Type | Description |
|---|---|---|
historical_df |
DataFrame
|
A Pandas DataFrame containing the historical data for a location. |
Source code in src/prediction_pipeline/sourcing_data/source_historic_parking_data.py
37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 | |
process_all_locations(parking_sensors)
Process and fetch all types of historical data for each location in the parking sensors dictionary.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
parking_sensors |
dict
|
Dictionary containing location slugs as keys and location IDs as values. |
required |
Source code in src/prediction_pipeline/sourcing_data/source_historic_parking_data.py
79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 | |
source_historic_visitor_count()
Source historic visitor count data from AWS S3.
Source code in src/prediction_pipeline/sourcing_data/source_historic_visitor_count.py
106 107 108 109 110 111 112 113 114 115 116 | |
source_data_from_aws_s3(path, **kwargs)
Loads individual or multiple CSV files from an AWS S3 bucket. Args: path (str): The path to the CSV files on AWS S3. **kwargs: Additional arguments to pass to the read_csv function. Returns: pd.DataFrame: The DataFrame containing the data from the CSV files.
Source code in src/prediction_pipeline/sourcing_data/source_visitor_center_data.py
7 8 9 10 11 12 13 14 15 16 | |
source_preprocessed_hourly_visitor_center_data()
Load the preprocessed hourly visitor center data from AWS S3.
Source code in src/prediction_pipeline/sourcing_data/source_visitor_center_data.py
23 24 25 26 27 28 29 30 31 32 33 34 | |
get_hourly_data(region, start_time, end_time)
Fetch hourly weather data for a specified region and date range.
This function retrieves hourly weather data from the Meteostat API or another defined source, returning it as a pandas DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
region |
Point
|
A |
required |
start_date |
datetime
|
The start date for data retrieval. |
required |
end_date |
datetime
|
The end date for data retrieval. |
required |
Returns:
| Type | Description |
|---|---|
|
pandas.DataFrame: A DataFrame containing hourly weather data with the following columns: - time: Datetime of the record. - temp: Temperature in degrees Celsius. - dwpt: Dew point in degrees Celsius. - prcp: Precipitation in millimeters. - wdir: Wind direction in degrees. - wspd: Wind speed in km/h. - wpgt: Wind gust in km/h. - pres: Sea-level air pressure in hPa. - tsun: Sunshine duration in minutes. - snow: Snowfall in millimeters. - rhum: Relative humidity in percent. - coco: Weather condition code. |
Source code in src/prediction_pipeline/sourcing_data/source_weather.py
65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 | |
process_hourly_data(data)
Process raw hourly weather data by cleaning and formatting.
This function drops unnecessary columns, renames the remaining columns to more descriptive names, and converts the 'time' column to a datetime format.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data |
DataFrame
|
A DataFrame containing raw hourly weather data. |
required |
Returns:
| Type | Description |
|---|---|
|
pandas.DataFrame: A DataFrame containing the processed hourly weather data with the following columns: - Time: Datetime of the record. - Temperature (°C): Temperature in degrees Celsius. - Wind Speed (km/h): Wind speed in km/h. - Relative Humidity (%): Relative humidity in percent. - coco_2: Weather condition code. |
Source code in src/prediction_pipeline/sourcing_data/source_weather.py
100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 | |
source_weather_data(start_time, end_time)
This function creates a point over the Bavarian Forest National Park, retrieves hourly weather data for the specified time period, processes the data to extract necessary weather parameters, and saves the processed data to a CSV file.
Source code in src/prediction_pipeline/sourcing_data/source_weather.py
138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 | |
add_daily_max_values(df, columns)
Add columns to the DataFrame that show the maximum daily value for each weather characteristic.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df |
DataFrame
|
DataFrame with 'Time' and multiple weather-related columns. - 'Time': Datetime column with timestamps. - columns (list of str): List of column names to compute the daily maximum values for. |
required |
Returns:
| Type | Description |
|---|---|
|
pd.DataFrame: DataFrame with new columns that contain the maximum values for each day, repeated for every hour. |
Source code in src/prediction_pipeline/pre_processing/features_zscoreweather_distanceholidays.py
110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 | |
add_moving_z_scores(df, columns, window_size)
Add moving z-score columns for weather characteristics based on their daily maximum values.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df |
DataFrame
|
DataFrame with 'Time' and daily maximum columns. - 'Time': Datetime column with timestamps. |
required |
columns |
list of str
|
List of column names to compute the moving z-scores for. |
required |
window_size |
int
|
Size of the moving window in days. |
required |
Returns:
| Type | Description |
|---|---|
|
pd.DataFrame: DataFrame with new columns that contain the moving z-scores for each column. |
Source code in src/prediction_pipeline/pre_processing/features_zscoreweather_distanceholidays.py
140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 | |
add_nearest_holiday_distance(df)
Add columns to the DataFrame calculating the distance to the nearest holiday for both 'Feiertag_Bayern' and 'Feiertag_CZ'.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df |
DataFrame
|
DataFrame with 'Time', 'Feiertag_Bayern', and 'Feiertag_CZ' columns. - 'Time': Datetime column with timestamps. - 'Feiertag_Bayern': Boolean column indicating if the date is a holiday in Bayern. - 'Feiertag_CZ': Boolean column indicating if the date is a holiday in CZ. |
required |
Returns:
| Type | Description |
|---|---|
|
pd.DataFrame: DataFrame with two new columns: - 'Distance_to_Nearest_Holiday_Bayern': Distance in days to the nearest holiday in Bayern for each day/row. - 'Distance_to_Nearest_Holiday_CZ': Distance in days to the nearest holiday in CZ for each day/row. |
Source code in src/prediction_pipeline/pre_processing/features_zscoreweather_distanceholidays.py
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 | |
load_csv_files_from_aws_s3(path, **kwargs)
Loads individual or multiple CSV files from an AWS S3 bucket. Args: path (str): The path to the CSV files on AWS S3. **kwargs: Additional arguments to pass to the read_csv function. Returns: pd.DataFrame: The DataFrame containing the data from the CSV files.
Source code in src/prediction_pipeline/pre_processing/features_zscoreweather_distanceholidays.py
19 20 21 22 23 24 25 26 27 28 | |
slice_at_first_non_null(df)
Slices the DataFrame starting at the first non-null value in the 'Feiertag_Bayern' column.
We don't have data for holidays in 2016, so the function finds the index of the first non-null value in the 'Feiertag_Bayern' column and returns the DataFrame sliced from that index onward.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df |
DataFrame
|
DataFrame containing the 'Feiertag_Bayern' column. |
required |
Returns:
| Type | Description |
|---|---|
|
pandas.DataFrame: The sliced DataFrame starting from the first non-null value in 'Feiertag_Bayern'. |
Source code in src/prediction_pipeline/pre_processing/features_zscoreweather_distanceholidays.py
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 | |
write_csv_file_to_aws_s3(df, path, **kwargs)
Writes an individual CSV file to AWS S3.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df |
DataFrame
|
The DataFrame to write. |
required |
path |
str
|
The path to the CSV files on AWS S3. |
required |
**kwargs |
Additional arguments to pass to the to_csv function. |
{}
|
Source code in src/prediction_pipeline/pre_processing/features_zscoreweather_distanceholidays.py
187 188 189 190 191 192 193 194 195 196 197 | |
check_data_quality(data, sensor)
Check data quality - if the occupancy is greater than the capacity of the parking space
Source code in src/prediction_pipeline/pre_processing/impute_missing_parking_data.py
64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 | |
check_data_quality_occupancy_rate(data, sensor)
Check data quality - if the occupancy rate is greater than 100
Source code in src/prediction_pipeline/pre_processing/impute_missing_parking_data.py
98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 | |
check_missing_data_per_sensor(data, sensor)
Check missing data per sensor
Source code in src/prediction_pipeline/pre_processing/impute_missing_parking_data.py
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 | |
fill_missing_values(data)
Fill missing values in the data
Source code in src/prediction_pipeline/pre_processing/impute_missing_parking_data.py
24 25 26 27 28 29 30 31 | |
impute_occupancy_values(data)
Impute occupancy values where the occupancy is greater than the capacity
Source code in src/prediction_pipeline/pre_processing/impute_missing_parking_data.py
51 52 53 54 55 56 57 58 59 60 61 | |
main()
Main function to run the script
Source code in src/prediction_pipeline/pre_processing/impute_missing_parking_data.py
116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 | |
save_higher_occupancy_rate(data, sensor)
Save the rows where the occupancy rate is greater than 100
Source code in src/prediction_pipeline/pre_processing/impute_missing_parking_data.py
85 86 87 88 89 90 91 92 93 94 95 96 | |
create_datetimeindex(df)
Prepare DataFrame by ensuring the index is a DateTimeIndex, resampling to hourly frequency, and handling missing values.
Parameters: - df: DataFrame containing the data. - "Time": Name of the timestamp column to convert and set as the index.
Returns: - df: DataFrame resampled to hourly frequency with missing values handled.
Source code in src/prediction_pipeline/pre_processing/join_sensor_weather_visitorcenter.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | |
get_joined_dataframe(weather_data, visitor_count_data, visitorcenter_data)
Main function to run the data joining pipeline.
This function loads the visitor count, visitor center and weather data, preprocesses them and joins them into one dataframe.
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: The joined data. |
Source code in src/prediction_pipeline/pre_processing/join_sensor_weather_visitorcenter.py
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 | |
join_dataframes(df_list)
Joins a list of DataFrames using an outer join along the columns.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df_list |
list of pd.DataFrame
|
A list of pandas DataFrames to join. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: A single DataFrame resulting from concatenating all input DataFrames along columns. |
Source code in src/prediction_pipeline/pre_processing/join_sensor_weather_visitorcenter.py
32 33 34 35 36 37 38 39 40 41 42 | |
Clean historic sensor data from 2016 to 2024. In the docstring of every function you can check what it does and the assumptions that were made.
Usage: - Change the global variables section if needed - Fill your AWS credentiales
Output: - Returns the preprocessed data
calculate_traffic_metrics_abs(df)
This function calculates several traffic metrics and adds them to the DataFrame:
- traffic_abs: The sum of all INs and OUTs for every sensor
- sum_IN_abs: The sum of all columns containing 'IN' in their names.
- sum_OUT_abs: The sum of all columns containing 'OUT' in their names.
- diff_abs: The difference between sum_IN_abs and sum_OUT_abs.
- occupancy_abs: The cumulative sum of diff_abs, representing the occupancy over time.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df |
DataFrame
|
DataFrame containing traffic data. |
required |
Returns:
| Type | Description |
|---|---|
|
pandas.DataFrame: The DataFrame with additional columns for absolute traffic metrics. |
Source code in src/prediction_pipeline/pre_processing/preprocess_historic_visitor_count_data.py
405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 | |
correct_and_impute_times(df)
Corrects repeated timestamps caused by a 2-hour interval that is indicative of a daylight saving.
The function operates under the following assumptions: 1. By default every interval should be of 1 hour 2. If any interval differ from this, particularly the repeated timestamp is corrected by subtracting one hour. 3. The data values for the corrected timestamp are then imputed from the next available row. 4. 2017 is an odd year where the null row is not the one with the 2 hours interval, but the one with 0. We fixed this manually for this specific rows.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df |
DataFrame
|
A DataFrame containing a 'Time' column with datetime-like values and other associated data columns. |
required |
Returns:
| Type | Description |
|---|---|
|
pandas.DataFrame: The corrected DataFrame with timestamps set as the index and sorted chronologically. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the 'Time' column is missing from the DataFrame. |
KeyError
|
If an index out of range occurs due to imputation attempts beyond the DataFrame bounds. |
Source code in src/prediction_pipeline/pre_processing/preprocess_historic_visitor_count_data.py
169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 | |
correct_non_replaced_sensors(df)
Replaces data with NaN for non-replaced sensors in the DataFrame based on specified timestamps. A dictionary is provided where keys are timestamps as strings and values are lists of column names that should be set to NaN if the index is earlier than the timestamp.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df |
DataFrame
|
The DataFrame to be corrected. |
required |
Returns:
| Type | Description |
|---|---|
|
pd.DataFrame: The DataFrame with corrected sensor data. |
Source code in src/prediction_pipeline/pre_processing/preprocess_historic_visitor_count_data.py
210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 | |
correct_overlapping_sensor_data(df)
Corrects sensor overlapping data by setting specific values to NaN based on replacement dates. Also filters the DataFrame to include only rows with an index timestamp on or after "2016-05-10 03:00:00". This is 3am after the installing date for the first working sensor.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df |
DataFrame
|
The DataFrame containing sensor data to be corrected. |
required |
Returns:
| Type | Description |
|---|---|
|
pd.DataFrame: The DataFrame with corrected sensor data. |
Source code in src/prediction_pipeline/pre_processing/preprocess_historic_visitor_count_data.py
241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 | |
fix_columns_names(df)
Processes the given DataFrame by renaming columns, dropping specified columns, and creating a new column for Bucina_Multi IN by summing the Bucina_Multi Fahrräder IN and Bucina_Multi Fußgänger IN columns. .
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df |
DataFrame
|
The DataFrame to be modified. |
required |
rename |
dict
|
A dictionary where the keys are existing column names and the values are the new column names. |
required |
drop |
list
|
A list of column names that should be removed from the DataFrame. |
required |
create |
str
|
The name of the new column that will be created by summing the "Bucina_Multi Fahrräder IN" and "Bucina_Multi Fußgänger IN" columns. |
required |
Returns:
| Type | Description |
|---|---|
|
pd.DataFrame: The modified DataFrame with the specified changes applied. |
Source code in src/prediction_pipeline/pre_processing/preprocess_historic_visitor_count_data.py
92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 | |
handle_outliers(df)
Transform to NaN every value higher than 800. During exploration we found that values over that are outliers. There were only 6 rows with any count over 800
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df |
DataFrame
|
DataFrame with values to be turned to NaN. |
required |
Returns:
| Type | Description |
|---|---|
|
pandas.DataFrame: The modified DataFrame with values over 800 turned to NaN |
Source code in src/prediction_pipeline/pre_processing/preprocess_historic_visitor_count_data.py
354 355 356 357 358 359 360 361 362 363 364 365 366 367 | |
merge_columns(df)
Merges columns from replaced sensors in the DataFrame into new combined columns based on a predefined mapping and drops the original columns after merging. Additionally, drops columns with names containing "Fahrräder" or "Fußgänger" as we will not use that distinction.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df |
DataFrame
|
A DataFrame containing columns to be merged. |
required |
Returns:
| Type | Description |
|---|---|
|
pandas.DataFrame: The modified DataFrame with the new merged columns, original columns removed, and Fahrräder or Fußgänger columns dropped. |
Source code in src/prediction_pipeline/pre_processing/preprocess_historic_visitor_count_data.py
369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 | |
parse_german_dates(df, date_column_name)
Parses German dates in the specified date column of the DataFrame using regex, including hours and minutes if available.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df |
DataFrame
|
The DataFrame containing the date column. |
required |
date_column_name |
str
|
The name of the date column. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: The DataFrame with parsed German dates. |
Source code in src/prediction_pipeline/pre_processing/preprocess_historic_visitor_count_data.py
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 | |
write_csv_file_to_aws_s3(df, path, **kwargs)
Writes an individual CSV file to AWS S3.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df |
DataFrame
|
The DataFrame to write. |
required |
path |
str
|
The path to the CSV files on AWS S3. |
required |
**kwargs |
Additional arguments to pass to the to_csv function. |
{}
|
Source code in src/prediction_pipeline/pre_processing/preprocess_historic_visitor_count_data.py
431 432 433 434 435 436 437 438 439 440 441 | |
add_and_translate_day_of_week(df)
Create a new column 'Wochentag' that represents the day of the week in German.
Parameters: df (pandas.DataFrame): DataFrame containing the 'Datum' column with date information.
Returns: pandas.DataFrame: DataFrame with updated 'Wochentag' column in German.
Source code in src/prediction_pipeline/pre_processing/preprocess_visitor_center_data.py
215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 | |
add_date_variables(df)
Create new columns for day, month, and year from a date column in the DataFrame.
Parameters: df (pandas.DataFrame): DataFrame containing the 'Datum' column with date information.
Returns: pandas.DataFrame: DataFrame with additional columns for day, month, and year.
Source code in src/prediction_pipeline/pre_processing/preprocess_visitor_center_data.py
166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 | |
add_season_variable(df)
Create a new column 'Jahreszeit' in the DataFrame based on the month variable.
Parameters: df (pandas.DataFrame): DataFrame containing the 'Monat' column with month information.
Returns: pandas.DataFrame: DataFrame with an additional 'Jahreszeit' column representing the season.
Source code in src/prediction_pipeline/pre_processing/preprocess_visitor_center_data.py
191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 | |
add_weekend_variable(df)
Create a new binary column 'Wochenende' indicating whether the day is a weekend.
Parameters: df (pandas.DataFrame): DataFrame containing the 'Wochentag' column with German day names.
Returns: pandas.DataFrame: DataFrame with an additional 'Wochenende' column indicating weekend status.
Source code in src/prediction_pipeline/pre_processing/preprocess_visitor_center_data.py
251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 | |
correct_and_convert_schulferien(df_visitcenters)
Corrects a typo in the 'Schulferien_Bayern' column and converts it to boolean type.
Parameters: df (pandas.DataFrame): DataFrame containing the 'Schulferien_Bayern' column.
Returns: pandas.DataFrame: DataFrame with corrected 'Schulferien_Bayern' values and converted to boolean type.
Source code in src/prediction_pipeline/pre_processing/preprocess_visitor_center_data.py
49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 | |
correct_and_convert_wgm_geoeffnet(df)
Corrects the 'WGM_geoeffnet' column by replacing the value 11 with 1. Converts the column to boolean type.
Parameters: df (pandas.DataFrame): DataFrame containing the 'WGM_geoeffnet' column.
Returns: pandas.DataFrame: DataFrame with 'WGM_geoeffnet' corrected and converted to boolean type.
Source code in src/prediction_pipeline/pre_processing/preprocess_visitor_center_data.py
102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 | |
correct_besuchszahlen_heh(df)
Corrects the 'Besuchszahlen_HEH' column by rounding up values with non-zero fractional parts to the nearest whole number. Converts the column to Int64 type to retain NaN values.
Parameters: df (pandas.DataFrame): DataFrame containing the 'Besuchszahlen_HEH' column.
Returns: pandas.DataFrame: DataFrame with 'Besuchszahlen_HEH' corrected and converted to Int64 type.
Source code in src/prediction_pipeline/pre_processing/preprocess_visitor_center_data.py
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 | |
create_hourly_dataframe(df)
Expands the daily data in the DataFrame to an hourly level by duplicating each day into 24 hourly rows.
Parameters: df (pandas.DataFrame): DataFrame containing daily data with a 'Datum' column representing dates.
Returns: pandas.DataFrame: New DataFrame with an hourly level where each day is expanded into 24 hourly rows.
Source code in src/prediction_pipeline/pre_processing/preprocess_visitor_center_data.py
382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 | |
detect_outliers_std(df, column, num_sd=7)
Detect outliers in a specific column of the DataFrame using the standard deviation method.
Parameters: df (pandas.DataFrame): DataFrame containing the column to check. column (str): Name of the column to check for outliers. num_sd (int): Number of standard deviations to define the outlier bounds (default is 7).
Returns: pandas.DataFrame: DataFrame containing rows with outliers in the specified column.
Source code in src/prediction_pipeline/pre_processing/preprocess_visitor_center_data.py
315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 | |
handle_outliers(df, num_sd=7)
Detect and handle outliers for a list of columns by replacing them with NaN.
Parameters: df (pandas.DataFrame): DataFrame containing the columns to check. num_sd (int): Number of standard deviations to define the outlier bounds (default is 7).
Returns: pandas.DataFrame: DataFrame with outliers replaced by NaN in the specified columns.
Source code in src/prediction_pipeline/pre_processing/preprocess_visitor_center_data.py
338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 | |
remove_last_row_if_needed(df)
Removes the last row from the DataFrame if it has 2923 rows.
Parameters: df (pandas.DataFrame): DataFrame to be checked and modified.
Returns: pandas.DataFrame: Updated DataFrame with the last row removed if the initial length was 2923.
Source code in src/prediction_pipeline/pre_processing/preprocess_visitor_center_data.py
121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 | |
rename_and_set_time_as_index(df)
Rename columns, convert 'time' column to datetime, and set 'time' as the index.
Parameters: df (pandas.DataFrame): DataFrame containing data with a 'Datum' column to be renamed and converted.
Returns: pandas.DataFrame: The cleaned DataFrame with 'Datum' renamed to 'time', converted to datetime, and 'time' set as index.
Source code in src/prediction_pipeline/pre_processing/preprocess_visitor_center_data.py
404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 | |
reorder_columns(df)
Reorder columns in the DataFrame to place date-related variables together.
Parameters: df (pandas.DataFrame): DataFrame with various columns including date-related variables.
Returns: pandas.DataFrame: DataFrame with columns reordered to place date-related variables next to each other.
Source code in src/prediction_pipeline/pre_processing/preprocess_visitor_center_data.py
269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 | |
write_parquet_file_to_aws_s3(df, path, **kwargs)
Writes an individual Parquet file to AWS S3.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df |
DataFrame
|
The DataFrame to write. |
required |
path |
str
|
The path to the Parquet files on AWS S3. |
required |
**kwargs |
Additional arguments to pass to the to_parquet function. |
{}
|
Source code in src/prediction_pipeline/pre_processing/preprocess_visitor_center_data.py
426 427 428 429 430 431 432 433 434 435 436 437 438 439 | |
fill_missing_values(data, parameters)
Fill missing values in the weather data using linear interpolation or zero values.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data |
DataFrame
|
Processed hourly weather data. |
required |
parameters |
list
|
List of column names to process. |
required |
Returns:
| Type | Description |
|---|---|
|
pandas.DataFrame: DataFrame with missing values filled. |
Source code in src/prediction_pipeline/pre_processing/preprocess_weather_data.py
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 | |
process_weather_data(sourced_df)
This function creates a point over the Bavarian Forest National Park, retrieves hourly weather data for the specified time period, processes the data to extract necessary weather parameters, and saves the processed data to a CSV file.
Source code in src/prediction_pipeline/pre_processing/preprocess_weather_data.py
74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 | |
add_and_translate_day_of_week(df)
Create a new column 'Wochentag' that represents the day of the week in German.
Parameters: df (pandas.DataFrame): DataFrame containing the 'Datum' column with date information.
Returns: pandas.DataFrame: DataFrame with updated 'Wochentag' column in German.
Source code in src/prediction_pipeline/pre_processing/visitor_center_processing_script.py
233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 | |
add_date_variables(df)
Create new columns for day, month, and year from a date column in the DataFrame.
Parameters: df (pandas.DataFrame): DataFrame containing the 'Datum' column with date information.
Returns: pandas.DataFrame: DataFrame with additional columns for day, month, and year.
Source code in src/prediction_pipeline/pre_processing/visitor_center_processing_script.py
184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 | |
add_season_variable(df)
Create a new column 'Jahreszeit' in the DataFrame based on the month variable.
Parameters: df (pandas.DataFrame): DataFrame containing the 'Monat' column with month information.
Returns: pandas.DataFrame: DataFrame with an additional 'Jahreszeit' column representing the season.
Source code in src/prediction_pipeline/pre_processing/visitor_center_processing_script.py
209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 | |
add_weekend_variable(df)
Create a new binary column 'Wochenende' indicating whether the day is a weekend.
Parameters: df (pandas.DataFrame): DataFrame containing the 'Wochentag' column with German day names.
Returns: pandas.DataFrame: DataFrame with an additional 'Wochenende' column indicating weekend status.
Source code in src/prediction_pipeline/pre_processing/visitor_center_processing_script.py
269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 | |
correct_and_convert_schulferien(df_visitcenters)
Corrects a typo in the 'Schulferien_Bayern' column and converts it to boolean type.
Parameters: df (pandas.DataFrame): DataFrame containing the 'Schulferien_Bayern' column.
Returns: pandas.DataFrame: DataFrame with corrected 'Schulferien_Bayern' values and converted to boolean type.
Source code in src/prediction_pipeline/pre_processing/visitor_center_processing_script.py
67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 | |
correct_and_convert_wgm_geoeffnet(df)
Corrects the 'WGM_geoeffnet' column by replacing the value 11 with 1. Converts the column to boolean type.
Parameters: df (pandas.DataFrame): DataFrame containing the 'WGM_geoeffnet' column.
Returns: pandas.DataFrame: DataFrame with 'WGM_geoeffnet' corrected and converted to boolean type.
Source code in src/prediction_pipeline/pre_processing/visitor_center_processing_script.py
120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 | |
correct_besuchszahlen_heh(df)
Corrects the 'Besuchszahlen_HEH' column by rounding up values with non-zero fractional parts to the nearest whole number. Converts the column to Int64 type to retain NaN values.
Parameters: df (pandas.DataFrame): DataFrame containing the 'Besuchszahlen_HEH' column.
Returns: pandas.DataFrame: DataFrame with 'Besuchszahlen_HEH' corrected and converted to Int64 type.
Source code in src/prediction_pipeline/pre_processing/visitor_center_processing_script.py
99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 | |
create_hourly_dataframe(df)
Expands the daily data in the DataFrame to an hourly level by duplicating each day into 24 hourly rows.
Parameters: df (pandas.DataFrame): DataFrame containing daily data with a 'Datum' column representing dates.
Returns: pandas.DataFrame: New DataFrame with an hourly level where each day is expanded into 24 hourly rows.
Source code in src/prediction_pipeline/pre_processing/visitor_center_processing_script.py
400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 | |
detect_outliers_std(df, column, num_sd=7)
Detect outliers in a specific column of the DataFrame using the standard deviation method.
Parameters: df (pandas.DataFrame): DataFrame containing the column to check. column (str): Name of the column to check for outliers. num_sd (int): Number of standard deviations to define the outlier bounds (default is 7).
Returns: pandas.DataFrame: DataFrame containing rows with outliers in the specified column.
Source code in src/prediction_pipeline/pre_processing/visitor_center_processing_script.py
333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 | |
handle_outliers(df, num_sd=7)
Detect and handle outliers for a list of columns by replacing them with NaN.
Parameters: df (pandas.DataFrame): DataFrame containing the columns to check. num_sd (int): Number of standard deviations to define the outlier bounds (default is 7).
Returns: pandas.DataFrame: DataFrame with outliers replaced by NaN in the specified columns.
Source code in src/prediction_pipeline/pre_processing/visitor_center_processing_script.py
356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 | |
remove_last_row_if_needed(df)
Removes the last row from the DataFrame if it has 2923 rows.
Parameters: df (pandas.DataFrame): DataFrame to be checked and modified.
Returns: pandas.DataFrame: Updated DataFrame with the last row removed if the initial length was 2923.
Source code in src/prediction_pipeline/pre_processing/visitor_center_processing_script.py
139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 | |
rename_and_set_time_as_index(df)
Rename columns, convert 'time' column to datetime, and set 'time' as the index.
Parameters: df (pandas.DataFrame): DataFrame containing data with a 'Datum' column to be renamed and converted.
Returns: pandas.DataFrame: The cleaned DataFrame with 'Datum' renamed to 'time', converted to datetime, and 'time' set as index.
Source code in src/prediction_pipeline/pre_processing/visitor_center_processing_script.py
422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 | |
reorder_columns(df)
Reorder columns in the DataFrame to place date-related variables together.
Parameters: df (pandas.DataFrame): DataFrame with various columns including date-related variables.
Returns: pandas.DataFrame: DataFrame with columns reordered to place date-related variables next to each other.
Source code in src/prediction_pipeline/pre_processing/visitor_center_processing_script.py
287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 | |
source_data_from_aws_s3(path, **kwargs)
Loads individual or multiple CSV files from an AWS S3 bucket. Args: path (str): The path to the CSV files on AWS S3. **kwargs: Additional arguments to pass to the read_csv function. Returns: pd.DataFrame: The DataFrame containing the data from the CSV files.
Source code in src/prediction_pipeline/pre_processing/visitor_center_processing_script.py
27 28 29 30 31 32 33 34 35 36 | |
write_parquet_file_to_aws_s3(df, path, **kwargs)
Writes an individual Parquet file to AWS S3.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df |
DataFrame
|
The DataFrame to write. |
required |
path |
str
|
The path to the Parquet files on AWS S3. |
required |
**kwargs |
Additional arguments to pass to the to_parquet function. |
{}
|
Source code in src/prediction_pipeline/pre_processing/visitor_center_processing_script.py
444 445 446 447 448 449 450 451 452 453 454 455 456 457 | |
load_latest_models(bucket_name, folder_prefix, models_names)
Load the latest files from an S3 folder based on the model names, and dynamically create variables with 'loaded_' as prefix.
Parameters: - bucket_name (str): The name of the S3 bucket. - folder_prefix (str): The folder path within the bucket. - models (list): List of model names without the 'extra_trees_' prefix.
Returns: - dict: A dictionary containing the loaded models with keys prefixed by 'loaded_'.
Source code in src/prediction_pipeline/modeling/create_inference_dfs.py
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 | |
predict_with_models(loaded_models, df_features)
Given a dictionary of models and a DataFrame of features, this function predicts the target values using each model and saves the inference predictions to AWS S3 (to be further loaded from Streamlit).
Parameters: - loaded_models (dict): A dictionary of models where keys are model names and values are the trained models. - df_features (pd.DataFrame): A DataFrame containing the features to make predictions on.
Returns: - pd.DataFrame: A DataFrame containing the predictions of all models per region.
Source code in src/prediction_pipeline/modeling/create_inference_dfs.py
68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 | |
join_inference_data(weather_data_inference, visitor_centers_data)
Merge weather data with visitor centers data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
weather_data_inference |
DataFrame
|
DataFrame containing weather data. |
required |
visitor_centers_data |
DataFrame
|
DataFrame containing visitor centers data. |
required |
Returns:
| Type | Description |
|---|---|
|
pd.DataFrame: Merged DataFrame with selected columns from visitor centers data. |
Source code in src/prediction_pipeline/modeling/preprocess_inference_features.py
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | |
source_preprocess_inference_data(weather_data_inference, hourly_visitor_center_data, start_time, end_time)
Source and preprocess inference data from weather and visitor center sources.
This function fetches weather and visitor center data, merges them, and computes additional features such as nearest holiday distance, daily max values, and moving z-scores.
Returns:
| Type | Description |
|---|---|
|
pd.DataFrame: DataFrame containing preprocessed inference data. |
Source code in src/prediction_pipeline/modeling/preprocess_inference_features.py
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 | |
run_inference(preprocessed_hourly_visitor_center_data)
Run the inference pipeline. Fetches the latest weather forecasts, preprocesses data, and makes predictions.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
preprocessed_hourly_visitor_center_data |
DataFrame
|
The preprocessed hourly visitor center data. |
required |
Returns:
| Type | Description |
|---|---|
|
None |
Source code in src/prediction_pipeline/modeling/run_inference.py
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 | |
change_datatypes(df, dtype_dict)
Change column datatypes based on dtype_dict.
Source code in src/prediction_pipeline/modeling/source_and_feature_selection.py
269 270 271 272 273 274 275 276 | |
filter_features_for_modelling(df)
Filter the features for modelling.
Source code in src/prediction_pipeline/modeling/source_and_feature_selection.py
391 392 393 394 395 396 | |
get_regionwise_IN_and_OUT_columns(df)
Preprocess the data by summing IN and OUT columns for each region.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df |
DataFrame
|
The DataFrame containing the data to preprocess. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: The preprocessed DataFrame. |
Source code in src/prediction_pipeline/modeling/source_and_feature_selection.py
198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 | |
load_csv_files_from_aws_s3(path, **kwargs)
Loads individual or multiple CSV files from an AWS S3 bucket. Args: path (str): The path to the CSV files on AWS S3. **kwargs: Additional arguments to pass to the read_csv function. Returns: pd.DataFrame: The DataFrame containing the data from the CSV files.
Source code in src/prediction_pipeline/modeling/source_and_feature_selection.py
222 223 224 225 226 227 228 229 230 231 232 233 | |
merge_new_features(df, df_newfeatures)
Merges new features from df_newfeatures into df on 'Time' column.
Source code in src/prediction_pipeline/modeling/source_and_feature_selection.py
242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 | |
process_transformations(df)
Process the transformations on the DataFrame.
Source code in src/prediction_pipeline/modeling/source_and_feature_selection.py
381 382 383 384 385 386 387 388 389 | |
build_lstm_model(input_shape, output_size)
Build and compile the LSTM model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_shape |
tuple
|
Shape of the input data (time_steps, features). |
required |
output_size |
int
|
Number of target variables. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
model |
Model
|
Compiled LSTM model. |
Source code in src/prediction_pipeline/modeling/train_lstm.py
70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 | |
create_sequences(X, y, window_size)
Create sequences from the data for LSTM input.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X |
ndarray
|
Feature array. |
required |
y |
ndarray
|
Target variable array. |
required |
window_size |
int
|
The number of time steps to include in each sequence. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
X_seq |
ndarray
|
Sequence features. |
y_seq |
ndarray
|
Sequence targets. |
Source code in src/prediction_pipeline/modeling/train_lstm.py
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 | |
load_data()
Load preprocessed data from the feature_selection_and_preprocessing module. Returns: df (pd.DataFrame): The preprocessed dataframe with features and target variables.
Source code in src/prediction_pipeline/modeling/train_lstm.py
9 10 11 12 13 14 15 16 | |
save_model(model, model_path='lstm_model.h5')
Save the trained model to a file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model |
Model
|
The trained LSTM model. |
required |
model_path |
str
|
Path to save the model file. |
'lstm_model.h5'
|
Source code in src/prediction_pipeline/modeling/train_lstm.py
109 110 111 112 113 114 115 116 117 | |
split_features_targets(df, target_vars)
Split the dataframe into features and target variables.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df |
DataFrame
|
The dataframe with features and target variables. |
required |
target_vars |
list
|
List of target variable column names. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
X |
ndarray
|
Feature array. |
y |
ndarray
|
Target variable array. |
Source code in src/prediction_pipeline/modeling/train_lstm.py
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | |
split_train_test(X, y, test_size=0.1)
Split data into training and test sets.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X |
ndarray
|
Feature array. |
required |
y |
ndarray
|
Target variable array. |
required |
test_size |
float
|
Proportion of the dataset to include in the test split. |
0.1
|
Returns:
| Name | Type | Description |
|---|---|---|
X_train |
ndarray
|
Training features. |
X_eval |
ndarray
|
Evaluation features. |
y_train |
ndarray
|
Training targets. |
y_eval |
ndarray
|
Evaluation targets. |
Source code in src/prediction_pipeline/modeling/train_lstm.py
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 | |
train_model(model, X_train_seq, y_train_seq, epochs=50, batch_size=64)
Train the LSTM model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model |
Model
|
The compiled LSTM model. |
required |
X_train_seq |
ndarray
|
Training sequence features. |
required |
y_train_seq |
ndarray
|
Training sequence targets. |
required |
epochs |
int
|
Number of epochs to train the model. |
50
|
batch_size |
int
|
Batch size for training. |
64
|
Returns:
| Name | Type | Description |
|---|---|---|
history |
History
|
Training history. |
Source code in src/prediction_pipeline/modeling/train_lstm.py
91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 | |
training_pipeline()
Main function to execute the training pipeline for the LSTM model.
Source code in src/prediction_pipeline/modeling/train_lstm.py
120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 | |
create_uuid()
Creates a unique identifier string.
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
A unique identifier string. |
Source code in src/prediction_pipeline/modeling/train_regressor.py
34 35 36 37 38 39 40 41 42 | |
save_models_to_aws_s3(model, save_path_models, model_name, local_path, uuid)
Save the model to AWS S3.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model |
The model to save. |
required | |
save_path_models |
str
|
The path to the CSV files on AWS S3. |
required |
model_name |
str
|
The name of the model. |
required |
local_path |
str
|
The local path to the model. |
required |
uuid |
str
|
The unique identifier string. |
required |
Returns:
| Type | Description |
|---|---|
None
|
None |
Source code in src/prediction_pipeline/modeling/train_regressor.py
63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 | |
save_predictions_to_aws_s3(df, save_path_predictions, filename, uuid)
Writes an individual CSV file to AWS S3.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df |
DataFrame
|
The DataFrame to write. |
required |
save_path_predictions |
str
|
The path to the CSV files on AWS S3. |
required |
filename |
str
|
The name of the CSV file. |
required |
uuid |
str
|
The unique identifier string. |
required |
Returns:
| Type | Description |
|---|---|
None
|
None |
Source code in src/prediction_pipeline/modeling/train_regressor.py
44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 | |