Discussion and Recommendations
Overview
In this section, we will discuss the limitations of our work, especially those related to data quality, and explain the influence of those limitations on the operationalization of the target variables. We will close with a list of recommendations and future directions.
Limitations
We assumed that different data sources would have varying levels of data quality. Below is an outline of the assumptions made and findings discovered about the data quality of our primary data sources (i.e., visitor sensors, visitor center, weather, and parking data).
Data Quality
Visitor Sensor Data
Sensor data is frequently operationalized as “ground truth” in social and methodological research because it does not suffer from the response errors typically associated with other data types, such as self-reports or observational data recorded by human coders. However, despite this common assumption among methodologists, we recognize that some unknown level of error may still be present in the visitor sensor data. We consider this data to be imperfect for the following reasons:
-
Sensors do not identify unique visitors to the park, making them susceptible to double counting—an unknowable source of measurement error in the data. We generally assume this source of error is small.
-
Sensor-specific IN and OUT counts may not correspond one-to-one with individual visitors, as some people might enter the park at one sensor and exit at another or leave through an exit without a sensor.
-
The limited number of sensors placed at non-random locations throughout the park potentially introduces bias:
- The absence of sensors at many other entry and exit points systematically underestimates the total number of visitors, making it difficult to accurately assess park occupancy.
- The locations of the installed sensors do not adequately represent all areas of the park, which introduces bias in the visitor count estimates. For example, a popular trail that loops through the park has two entrances and one exit. A sensor is installed at Waldhausreibe, located at one of the entrances—the less frequented one. This resulted in a significant discrepancy between the number of visitors entering (IN) and exiting (OUT) the park, with the IN count for this sensor being notably lower than the OUT count. Below are graphs (Figures 1 and 2) that illustrate this issue.
- The absence of sensors at many other entry and exit points systematically underestimates the total number of visitors, making it difficult to accurately assess park occupancy.
- The sensor data exhibit a "missing not at random" pattern (i.e., bias). Since the collection of visitor sensor data began in 2016, additional sensors have been installed over the years, leading to significant gaps in the dataset. Below is a graph (Figure 3) illustrating this issue: individual sensors are shown on the y-axis, while time (ranging from 2016 to 2024) is represented on the x-axis. In the graph, red indicates periods of missing data, while blue indicates when a sensor was installed, functioning, and actively collecting data.
- Furthermore, sensors can malfunction and require repairs, while cloud-connected sensors may go offline, contributing to greater error variance in the estimates of park visitors.
- The mechanism that is used to count visitors using these sensors is generated by detection in changes of infrared radiation, typically emitted by warm objects, including humans and animals. Although pyro sensors are calibrated to detect the human body temperature, miscounting of other living organisms is possible.

Figure 1: Waldhausreibe Sensor Discrepancy in Overall Visitor Counts from 2016 to 2024

Figure 2: Distrbution of Visitor Counts from Waldhausreibe Sensor by Hour of the Day

Figure 3: Missing Visitor Sensor Data. The y-axis shows individual sensors, and the x-axis represents time (2016–2024). Red indicates missing data, with large blocks indicating pre-installation periods and smaller gaps reflecting intermittent outages. Blue indicates periods when sensors were installed and actively collecting data
Visitor Center Data
Given that these data are counted manually and the data file itself uses manual data entry, we assumed that the data maintained in the visitor center data file has some unknowable source of human error (i.e., miscounting) as well as clerical errors in data entry. Many of the data used in the visitor center data is easily verifiable and thus fixable, such as the day of the week that a given date fell on, dates of national holidays, seasons, and whether or not visitor centers were open.
Weather Data
Meteostat data is sourced from weather stations, and we assume the data is collected from reliable sources and is subject to quality control (though local microclimate variations may influence representativeness). Measurement techniques are expected to remain consistent over time, ensuring comparability across future years of deployment of our prediction model and dashboard. The temporal resolution (hourly) fit our analysis needs, and linear interpolation was used in rare cases of missing weather data.
Parking Data
The sensor data used to identify occupied or vacant parking spaces appeared to be reliable insofar as only two of the twelve parking stations suffer from connectivity issues to the Bayern Cloud. Since the sampling of parking sensors is variable within and across sensors, analysis of historic parking could not be performed and these data were not usable in the prediction model. However we assume the the measurements taken by sensors are generally reliable.
How Data Quality Influenced the Operationalization of our Target Variables
Limitations of Using Sensor Data for Estimating Park Occupancy
The quality of the visitor sensor data limited how the target variable of our prediction models could be operationalized. Ideally, the outcome of interest in our prediction model would be hourly occupancy of the Bavarian Forest National Park; this would allow park management to have an estimate of how many visitors will be in the park at any given hour within a one-week forecast horizon. However, the limited number of visitor sensors produce estimates that are far too low to accurately reflect park occupancy. This target variable was thus not used.
From Traffic to Distinct Visitor Entries (IN) and Exits (OUT) for Improved Predictions
We initially considered operationalizing the target variable in our prediction models as "traffic." The idea was that summing the IN and OUT columns across sensors at the hourly level would give us a good estimate of park activity. In this case, "traffic" would represent the total number of visitors passing by a sensor, regardless of the direction they were walking.
We also explored using the sum of the IN columns and the sum of the OUT columns across sensors as two distinct target variables. This approach resulted in lower MSE compared to the traffic target variable. Given the improved model performance, we decided to use IN and OUT as separate target variables. This distinction also made sense from a theoretical standpoint: the time lag between people entering (IN) and exiting (OUT) the park justifies treating them separately, especially when making hourly predictions. We applied this approach to producing estimates for the entire park and across the 6 regions defined by the Bavarian Forest National Park.
Area-level Predictions: Regions vs. Trail Segments
Our project partners at the Bavarian Forest National Park initially sought to estimate park visitors at the trail-segment level, representing the finest spatial granularity for visitor estimates. To inform this approach, we considered using data from Komoot, a navigation app tailored for outdoor enthusiasts that provides maps, route planning, and user-generated content for activities like hiking and cycling.
We believed that leveraging Komoot's crowdsourced data would enable us to apply a distribution to the trail segments within the Bavarian Forest National Park to estimate visitor numbers. However, we ultimately recognized that the app's highly self-selected user base could introduce selection bias into any distribution derived from this data.
To address this issue, we shifted our focus to generating forecasts for different regions of the park, defined by clusters of visitor sensors. While this method does not offer the same level of spatial granularity as trail segments, it is implementable and can provide more detailed predictions for use by the Bavarian Forest National Park.
Limited Training Data Years Due to Sensor Installation Gaps
Although the Bavarian Forest National Park has been collecting visitor sensor data since 2016, the final number of installed sensors was established in 2023, after which no new sensors will be added. Due to significant gaps in sensor data from previous years, we decided to train our prediction models using visitor counts from sensors that were active starting in 2023 and continuing until the conclusion of our project in July 2024. A greater number of sensors provides better coverage of the park, and this robust data collection allows for more accurate visitor estimates. By focusing on this period, we can enhance the reliability of our predictions and provide the park management with valuable insights for future planning and resource allocation.
Recommendations and Future Directions
A Strategy for Improved Data Collection: Relocating the Waldhausreibe Sensor
One key recommendation for BFNP is to relocate the sensor currently positioned at the less-frequented entrance of the popular Waldhausreibe trail to its busier entrance. The current placement results in a significant discrepancy between park entries and exits, as this sensor fails to capture visitors using the more popular entrance. Addressing this issue in future data collections is crucial, as it would help narrow the gap between visitor entries and exits over time.
By reducing this discrepancy, data scientists working on this project in the future could apply calibration or adjustment weights to individual sensors to capture between sensor variability and importance, potentially allowing the park to estimate hourly occupancy instead of tracking relative flows of incoming and outgoing visitors. Furthermore, capturing incoming visitors at Waldhausreibe would enhance the quality of model training data, leading to improved predictions for overall visitor flows, but particularly in the Lusen-Mauth-Finsterau region of the park.
Model Training Experimentation with Select Sensors: Focusing on Pre-2023 Data
Since January 2021, sixteen sensors have been consistently collecting visitor entry and exit data without any interruptions. These sensors are located at: Brechhäuslau, Bucina, Deffernik, Falkenstein, Felswandergebiet, Ferdinandsthal, Fredenbrücke, Gfäll, Lusen, Racheldiensthütte, Scheuereck, Schillerstraße, Schwarzbachbrücke, Trinkwassertalsperre, Waldhausreibe, and Waldspielgelände.
An alternative approach to predicting overall park entries and exits could involve training Extra Trees Regressor models using data from these 16 sensors, rather than all 26 sensors starting from January 1, 2023. By leveraging a smaller but more consistent set of sensors with a longer history, this method may provide more accurate predictions of visitor flows in and out of the Bavarian National Forest. Time constraints prevented us from examining this alternative approach to model training but this approach may be promising to predict overall visitor flows and predictions at the sensor level instead of the region level.
Experiment with LSTM Neural Network With More Available Training Data in the Future
We compared approximately twenty different machine learning models to forecast visitor flows in the Bavarian Forest National Park using the Python package PyCaret. From this comparison, the Extra Trees Regressor emerged as the top candidate based on several model fit indices.
In addition, we developed a Long Short-Term Memory (LSTM) recurrent neural network. LSTM networks are specifically designed to learn from and retain information in sequences of data, using a unique memory cell structure that enables them to effectively capture long-term dependencies.
While the LSTM models performed similarly to the Extra Trees Regressor, we ultimately chose to implement the Extra Trees Regressor in our final product due to its lower computational demands and ability to perform well with less data. However, we anticipate that as more visitor flow data is collected at the Bavarian Forest National Park, switching to LSTM models may become advantageous, as LSTMs are better equipped to handle large datasets and complex temporal patterns. LSTM and Extra Trees Regressor models should be compared again in the future once more data is available.
Additionally, due to time constraints of the fellowship and project timeline, we were unable to test LSTM on the model training strategy outlined in the previous section. Applying LSTM to the eight specific visitor sensors that have been actively collecting data since 2021 may yield more accurate predictions for the park overall compared to those generated by Extra Trees models, which rely on data from sensors installed in 2023 and later.
Proposed Study: GPS Tracking and Survey Data to Monitor Spatial and Temporal Visitor Behavior in Bavarian Forest National Park
A future study could address many limitations of the current project by incorporating GPS tracking technology and survey data to analyze the spatial and temporal behavior of visitors in the Bavarian Forest National Park (BFNP) at a detailed level. Questionnaires could be administered to a random sample of park visitors using Stratified Probability Proportional to Size sampling based on random times and locations within different regions of the park. Survey respondents would report on their behaviors at BFNP, along with demographic and other relevant information. They could also be asked for consent to use a GPS tracking device to monitor their movement patterns within the park's trail networks for the duration of their visit to the park, and consent to link their GPS data to their survey responses could also be obtained.
GPS tracking data would be collected over a 12-month period to capture seasonal variations in visitor behavior. This rich dataset could be used to analyze visitor distributions across different trail segments, enabling the park to produce hourly forecasts for specific trail sections. Additionally, the survey data could provide insights into differences between local and tourist visitors, helping BFNP explore how these groups affect the park differently and may require distinct resources.
Currently, the park's sensors cannot distinguish between tourists and local visitors. Future studies could develop classification models to identify these groups more accurately, allowing for more granular predictions. GPS data could also inform the optimal placement of sensors, helping the park detect areas where visitors are missed at entry or exit points. Furthermore, the study could estimate and adjust for double-counting errors at visitor centers, improving the accuracy of future predictive models.