Big Schedule for Big Data

This article is the introductory part of the Big Schedule Optimization series addressing concerns of Big Data collection processes.

In most cases, analytical tools assume that the data is already collected, unified and stored in a database. Business Intelligence (BI) systems are dealing with structured data, which is already prepared for the creative intellectual exercises. Although the analytics is the main output of the Big Data processing, it cannot be done without proper data collection and delivery.

Data Quality

The quality of BI reporting and analysis depends directly on the Data Quality – completeness, timeliness, correctness and validity. This means that the ideal data collecting process should guarantee that:

  • no data is missed;
  • all data is delivered in time;
  • all data is identical to the source;
  • the source provides absolutely correct, complete and reliable data.

Unfortunately, the real world is not ideal, and these statements should be interpreted as:

  • What percentage of missed data and data gap is acceptable?
  • What timeframe is allowed for the data to be collected?
  • What data manipulation is acceptable – how many decimal places to keep? What text can be truncated? Which formatting can be ignored?
  • How the source data can be validated? Which alternatives are available? What to do with invalid data – ignore, try to correct or recover?

If we define just the percentage of missed data (let say 1% for the 10-minute interval series) this means that for 30 days we allow 7 hours of ‘silence’; which is definitely not acceptable by many businesses. Exception to this is the absence of data due to a weekend or holiday, which is not a ‘gap’, but expected unavailability.

This leads us to the next question – is it possible to retrieve or recover missed data? How and when? How does data recovery affect regular data collection? All these questions can be answered only per source, per data set, and per business case. The same data collection might be acceptable for one type of analysis and not acceptable for another one.

ZEMA data management solution provides robust and flexible data collection processes. It manages thousands of data sets from multiple sources all over the world. From real-time reports, which are being updated every few seconds; to historical records up to 20 years back and futures for more than 20 years ahead. From single weather station in the middle of the ocean; to power hub, which provides service to several countries.

Data Manager

There are many steps in the data collecting process:

  • accessing the sources;
  • streaming or downloading;
  • data processing – format recognition, parsing, conversion, unification;
  • validating and confirming (avoiding duplication and redundancy);
  • generating metadata for further reference and simplified data access; and
  • data storing and archiving.

In this article, we consider the ‘data pulling’ method, where data is collected from external sources. We assume that all the steps above are executed by independent data processors, where every single request is established and runs properly. All the processors are managed by a centralized scheduler (Data Manager) based on individual settings, data availability and relative prioritization.

The big question is how to optimize the schedule to avoid system conflicts, resource competition, minimize request queue and waiting time without missing or duplicating data. Having this question answered, we can also define the minimal requirements for the system, which runs effectively without wasting of sources in idle mode.

This topic does not include manual data processing execution (on demand) nor real-time request optimization. The instant optimization, based on system load or dynamic capacity, will be covered in future publications.

Data Requests

The described problem is similar to any dispatching service or traffic control. For optimization we need to know individual characteristics of each source request, like:

  • Data availability and required collection frequency;
  • The earliest/latest shift allowed for each call;
  • Processing time – max/min/average;
  • Required resources (external connections, database connections, memory, disc space);
  • Overlapping/sharing capabilities;
  • Cross-impact/restrictions; and
  • Potential points of failure/recovery.

In order to ‘fit the spot’ we also need to know:

  • Total available resources (memory, disc space, dynamic nodes);
  • Maximum number of simultaneous threads;
  • Queue capacity; and
  • Prioritization rules.

In this article, we consider the Data Availability Schedule as a part of the Big Data Collection Schedule Optimization topic.

Data Availability

First question for data collecting is availability. There is no point of requesting the data if it is not published or unavailable. Therefore, the system should know data availability and updating schedule.

Regularly updated data requires constant attention to the processing time. If it is too long, there is a possibility that the data is being updated during processing and full set will be inaccurate. In the situation when the processing time depends on system load, connection speed or other competing resources, the system takes statistics on previous requests rather than fixed pre-defined values.

Regularly updated data

The scheduler calls the processor on the earliest data availability point to avoid requests overlapping.

With some data sources, fixed data collection schedule does not work due to Uncertainty of data arrival. In this case, the system checks data availability prior to processing it. The question is – how often to check data arrival to collect it as early as possible without overloading the system with additional requests? Other side of the same question – how often to check data arrival in order to not miss the latest data processing point where it won’t be enough time to process the data before the next arrival.

Data with uncertain availability

Data arrival checking has two cycles – during expected data arrival period until it has arrived. There is no need to check availability during processing until the next expected data updating period.

Data Updating brings a separate challenge. How to distinguish the data which is not updated, from the updated but not changed? The source does not always provide the correct timestamp or unique data id. This can only be solved by deep business analysis of the data source. The simplified approach is if data was not changed during the expected period, it can be treated as an update without change.

In some cases data is only available within the Limited Availability window.

Data with limited availability

This brings an additional restriction on the latest data processing point: it should be scheduled to have enough time to complete the data processing while it was available.

Availability Schedule

Summarizing the topic, the robust data collection system should have an adjustable Data Availability Schedule, which consists of the following parts:

  • Planned Availability, defined by a Data Analyst or Industry Expert
    1. Data retrieval cycle;
    2. Data renewal period;
    3. Data availability/update try-out sub-cycle; and
    4. Planned data unavailability.
  • Data Quality
    1. Data completeness criteria (like number of records, granularity, range);
    2. Acceptable missed data criteria; and
    3. Acceptable data gap criteria.
  • Actions
    1. Vetoing rules (stopping / postponing execution on certain criteria);
    2. Additional data requests if acceptance criteria are not satisfied; and
    3. Reporting settings on a normal execution and on exceptions.
  • Statistics, collected from the previous executions
    1. Average delivery delay against planned availability;
    2. Maximum, minimum and average amount of successfully retrieved data.

This information relates only to the data source and allows to schedule the data retrieval requests properly. It also allows adjustment of the schedule automatically. For example, shift the request time if there are constant delays or change the cycle if there are regular vetoes raised.

Adding a self-learning component to the data collection system increases its stability, usability and effectiveness.

More aspects of the Big Schedule Optimization will be covered by upcoming publications.

ZEMA provides effective tools for collecting, processing and analyzing of Big Data from any public and private source, any geographical region and any industry. The professionals, who rely on ZEMA, make their business decisions with the maximum confidence.

Post a Comment


More Articles on In Depth

The Year-end Review and Future Projections: Awaiting More Synergy between the Energy Sector and IT

Lacking originality, we are closing 2015 with the yearly review and endeavored predictions of the near future; this time however, we will look at the energy sector in conjunction with data business and software solutions. The energy sector has been ... Read more »

Data Security + Data Surveillance = More Data Business

In today’s fast paced business environment, risk is everywhere and data security is rapidly changing. However, few things remain unchanged: hackers will create more sophisticated scripts to get around your security fortresses trying to steal your identity, money, and corporate ... Read more »

Weighing the Options: Build an Internal EDM System or Get ZEMA

The Role of Enterprise Data Management in Energy and Commodities Industries   Many large integrated oil and natural gas firms, commodity firms, and power utilities rely on an Enterprise Data Management (EDM) system to acquire the data required for their trade and ... Read more »

Natural Gas Coasts Along, while Other Energy Markets Test Rollercoasters

I do not recall more confusion and misalignment in the energy sector as is happening now. Any correlation and interdependency that existed previously among oil, coal, natural gas, and electricity markets seem to have evaporated; a multiplicity of independent events ... Read more »

Tesla and the Shift in Power: Part 2 – How Powerpack and Powerwall batteries will amplify the need for data management

Alongside hyperloops, rockets, and luxury electric vehicles, the Tesla Powerwall battery isn’t exactly the most sensational technology for which Elon Musk, CEO of Tesla Motors and SpaceX, has made headline news recently. Nonetheless, Powerwall has the potential to become Musk’s ... Read more »

Tesla and the Shift in Power: Part 1 – What Powerwall batteries could mean for energy consumers

On April 30, 2015, Elon Musk, CEO of Tesla Motors and SpaceX, took to the stage amidst cheering and rock music to announce, in his words, a “fundamental transformation of how the world works, how energy is delivered across the ... Read more »

 Effective Date: June 01, 2015

Price Reporting Assessments, Agencies, and Increased Asian Oil Consumption

Sales and purchases of varied commodities affect each country in the world, but few commodities have a global impact greater than that of oil. Price reporting agencies (PRAs) monitor the global oil industry carefully, producing price assessments which are meant ... Read more »

 Effective Date: April 28, 2015

Why Is Crisis so Paramount?

As the first quarter of 2015 draws to a close, we are beginning to get used to the concept of cheap oil. The question on everyone’s minds, of course, is how long it will last. March has seen the lowest ... Read more »

The Seven Ages of Oil Part 2: Boom and Bust, War and Peace, Growth and Decline

Part Two of Two: The Seven Ages Continued In Part One of our feature story from last month, we covered the first four of the “seven ages of oil.” 1859-1870: Illumination Births an Industry 1870-1911: Rockefeller Creates the Multinational Oil Standard 1911-1921: A Pax ... Read more »

The Seven Ages of Oil Part 1: Boom and Bust, War and Peace, Growth and Decline

Dateline January 2015: The Perilous Plummet The week of January 26, 2015, saw the price of a barrel of oil drop below $45. It has been a perilous plummet from a high of $100.52 just six months ago. If you ask ... Read more »

 Effective Date: January 30, 2015


Page 1 of 612345...Last »