This article is the introductory part of the Big Schedule Optimization series addressing concerns of Big Data collection processes.
In most cases, analytical tools assume that the data is already collected, unified and stored in a database. Business Intelligence (BI) systems are dealing with structured data, which is already prepared for the creative intellectual exercises. Although the analytics is the main output of the Big Data processing, it cannot be done without proper data collection and delivery.
The quality of BI reporting and analysis depends directly on the Data Quality – completeness, timeliness, correctness and validity. This means that the ideal data collecting process should guarantee that:
- no data is missed;
- all data is delivered in time;
- all data is identical to the source;
- the source provides absolutely correct, complete and reliable data.
Unfortunately, the real world is not ideal, and these statements should be interpreted as:
- What percentage of missed data and data gap is acceptable?
- What timeframe is allowed for the data to be collected?
- What data manipulation is acceptable – how many decimal places to keep? What text can be truncated? Which formatting can be ignored?
- How the source data can be validated? Which alternatives are available? What to do with invalid data – ignore, try to correct or recover?
If we define just the percentage of missed data (let say 1% for the 10-minute interval series) this means that for 30 days we allow 7 hours of ‘silence’; which is definitely not acceptable by many businesses. Exception to this is the absence of data due to a weekend or holiday, which is not a ‘gap’, but expected unavailability.
This leads us to the next question – is it possible to retrieve or recover missed data? How and when? How does data recovery affect regular data collection? All these questions can be answered only per source, per data set, and per business case. The same data collection might be acceptable for one type of analysis and not acceptable for another one.
ZEMA data management solution provides robust and flexible data collection processes. It manages thousands of data sets from multiple sources all over the world. From real-time reports, which are being updated every few seconds; to historical records up to 20 years back and futures for more than 20 years ahead. From single weather station in the middle of the ocean; to power hub, which provides service to several countries.
There are many steps in the data collecting process:
- accessing the sources;
- streaming or downloading;
- data processing – format recognition, parsing, conversion, unification;
- validating and confirming (avoiding duplication and redundancy);
- generating metadata for further reference and simplified data access; and
- data storing and archiving.
In this article, we consider the ‘data pulling’ method, where data is collected from external sources. We assume that all the steps above are executed by independent data processors, where every single request is established and runs properly. All the processors are managed by a centralized scheduler (Data Manager) based on individual settings, data availability and relative prioritization.
The big question is how to optimize the schedule to avoid system conflicts, resource competition, minimize request queue and waiting time without missing or duplicating data. Having this question answered, we can also define the minimal requirements for the system, which runs effectively without wasting of sources in idle mode.
This topic does not include manual data processing execution (on demand) nor real-time request optimization. The instant optimization, based on system load or dynamic capacity, will be covered in future publications.
The described problem is similar to any dispatching service or traffic control. For optimization we need to know individual characteristics of each source request, like:
- Data availability and required collection frequency;
- The earliest/latest shift allowed for each call;
- Processing time – max/min/average;
- Required resources (external connections, database connections, memory, disc space);
- Overlapping/sharing capabilities;
- Cross-impact/restrictions; and
- Potential points of failure/recovery.
In order to ‘fit the spot’ we also need to know:
- Total available resources (memory, disc space, dynamic nodes);
- Maximum number of simultaneous threads;
- Queue capacity; and
- Prioritization rules.
In this article, we consider the Data Availability Schedule as a part of the Big Data Collection Schedule Optimization topic.
First question for data collecting is availability. There is no point of requesting the data if it is not published or unavailable. Therefore, the system should know data availability and updating schedule.
Regularly updated data requires constant attention to the processing time. If it is too long, there is a possibility that the data is being updated during processing and full set will be inaccurate. In the situation when the processing time depends on system load, connection speed or other competing resources, the system takes statistics on previous requests rather than fixed pre-defined values.
The scheduler calls the processor on the earliest data availability point to avoid requests overlapping.
With some data sources, fixed data collection schedule does not work due to Uncertainty of data arrival. In this case, the system checks data availability prior to processing it. The question is – how often to check data arrival to collect it as early as possible without overloading the system with additional requests? Other side of the same question – how often to check data arrival in order to not miss the latest data processing point where it won’t be enough time to process the data before the next arrival.
Data arrival checking has two cycles – during expected data arrival period until it has arrived. There is no need to check availability during processing until the next expected data updating period.
Data Updating brings a separate challenge. How to distinguish the data which is not updated, from the updated but not changed? The source does not always provide the correct timestamp or unique data id. This can only be solved by deep business analysis of the data source. The simplified approach is if data was not changed during the expected period, it can be treated as an update without change.
In some cases data is only available within the Limited Availability window.
This brings an additional restriction on the latest data processing point: it should be scheduled to have enough time to complete the data processing while it was available.
Summarizing the topic, the robust data collection system should have an adjustable Data Availability Schedule, which consists of the following parts:
- Planned Availability, defined by a Data Analyst or Industry Expert
- Data retrieval cycle;
- Data renewal period;
- Data availability/update try-out sub-cycle; and
- Planned data unavailability.
- Data Quality
- Data completeness criteria (like number of records, granularity, range);
- Acceptable missed data criteria; and
- Acceptable data gap criteria.
- Vetoing rules (stopping / postponing execution on certain criteria);
- Additional data requests if acceptance criteria are not satisfied; and
- Reporting settings on a normal execution and on exceptions.
- Statistics, collected from the previous executions
- Average delivery delay against planned availability;
- Maximum, minimum and average amount of successfully retrieved data.
This information relates only to the data source and allows to schedule the data retrieval requests properly. It also allows adjustment of the schedule automatically. For example, shift the request time if there are constant delays or change the cycle if there are regular vetoes raised.
Adding a self-learning component to the data collection system increases its stability, usability and effectiveness.
More aspects of the Big Schedule Optimization will be covered by upcoming publications.
ZEMA provides effective tools for collecting, processing and analyzing of Big Data from any public and private source, any geographical region and any industry. The professionals, who rely on ZEMA, make their business decisions with the maximum confidence.