Tag

data

Browsing

By James Henderson

Partners building out IT services are best positioned to capitalise in a big data and analytics (BDA) market set to experience double-digit growth in 2019.

That’s according to new IDC findings, which forecasts investment to reach US$189.1 billion globally, representing an increase of 12 per cent over 2018.

Of note to the channel, IT services will be the largest category within the BDA market this year at $77.5 billion, followed by hardware purchases ($23.7 billion) and business services ($20.7 billion).

Collectively, IT and business services will account for more than half of all BDA revenues until 2022, according to IDC.

“Digital transformation is a key driver of BDA spending with executive-level initiatives resulting in deep assessments of current business practices and demands for better, faster, and more comprehensive access to data and related analytics and insights,” said Dan Vesset, group vice president of IDC.

“Enterprises are rearchitecting to meet these demands and investing in modern technology that will enable them to innovate and remain competitive. BDA solutions are at the heart of many of these investments.”

Meanwhile, Vesset said BDA-related software revenues will be $67.2 billion in 2019, with end-user query, reporting, and analysis tools ($13.6 billion) and relational data warehouse management tools ($12.1 billion) being the two largest software categories.

According to IDC, the BDA technology categories that will see the “fastest revenue growth” will be non-relational analytic data stores (34 per cent) and cognitive/AI software platforms (31.4 per cent).

“Big data technologies can be difficult to deploy and manage in a traditional, on premise environment,” added Jessica Goepfert, program vice president of IDC. “Add to that the exponential growth of data and the complexity and cost of scaling these solutions, and one can envision the organisational challenges and headaches.”

However, Goepfert said cloud can help “mitigate some of these hurdles”.

“Cloud’s promise of agility, scale, and flexibility combined with the incredible insights powered by BDA delivers a one-two punch of business benefits, which are helping to accelerate BDA adoption,” Goepfert explained.

“When we look at the opportunity trends for BDA in the cloud, the top three industries for adoption are professional services, personal and consumer services, and media. All three industries are rife with disruption and have high levels of digitisation potential.

“Additionally, we often find many smaller, innovative firms in this space; firms that appreciate the access to technologies that may have historically been out of reach to them either due to cost or IT complexity.”

By James Henderson

Sourced from ARN

Sourced from Dimensionless

The Next Generation of Data Science

Quite literally, I am stunned.

I have just completed my survey of data (from articles, blogs, white papers, university websites, curated tech websites, and research papers all available online) about predictive analytics.

And I have a reason to believe that we are standing on the brink of a revolution that will transform everything we know about data science and predictive analytics.

But before we go there, you need to know: why the hype about predictive analytics? What is predictive analytics?

Let’s cover that first.

 Importance of Predictive Analytics

 

Black Samsung Tablet Computer

By PhotoMix Ltd

 

According to Wikipedia:

Predictive analytics is an area of statistics that deals with extracting information from data and using it to predict trends and behavior patterns. The enhancement of predictive web analytics calculates statistical probabilities of future events online. Predictive analytics statistical techniques include data modeling, machine learning, AI, deep learning algorithms and data mining.

Predictive analytics is why every business wants data scientists. Analytics is not just about answering questions, it is also about finding the right questions to answer. The applications for this field are many, nearly every human endeavor can be listed in the excerpt from Wikipedia that follows listing the applications of predictive analytics:

From Wikipedia:

Predictive analytics is used in actuarial science, marketing, financial services, insurance, telecommunications, retail, travel, mobility, healthcare, child protection, pharmaceuticals, capacity planning, social networking, and a multitude of numerous other fields ranging from the military to online shopping websites, Internet of Things (IoT), and advertising.

In a very real sense, predictive analytics means applying data science models to given scenarios that forecast or generate a score of the likelihood of an event occurring. The data generated today is so voluminous that experts estimate that less than 1% is actually used for analysis, optimization, and prediction. In the case of Big Data, that estimate falls to 0.01% or less.

Common Example Use-Cases of Predictive Analytics

 

Components of Predictive Analytics

 

A skilled data scientist can utilize the prediction scores to optimize and improve the profit margin of a business or a company by a massive amount. For example:

  • If you buy a book for children on the Amazon website, the website identifies that you have an interest in that author and that genre and shows you more books similar to the one you just browsed or purchased.
  • YouTube also has a very similar algorithm behind its video suggestions when you view a particular video. The site identifies (or rather, the analytics algorithms running on the site identifies) more videos that you would enjoy watching based upon what you are watching now. In ML, this is called a recommender system.
  • Netflix is another famous example where recommender systems play a massive role in the suggestions for ‘shows you may like’ section, and the recommendations are well-known for their accuracy in most cases
  • Google AdWords (text ads at the top of every Google Search) that are displayed is another example of a machine learning algorithm whose usage can be classified under predictive analytics.
  • Departmental stores often optimize products so that common groups are easy to find. For example, the fresh fruits and vegetables would be close to the health foods supplements and diet control foods that weight-watchers commonly use. Coffee/tea/milk and biscuits/rusks make another possible grouping. You might think this is trivial, but department stores have recorded up to 20% increase in sales when such optimal grouping and placement was performed – again, through a form of analytics.
  • Bank loans and home loans are often approved with the credit scores of a customer. How is that calculated? An expert system of rules, classification, and extrapolation of existing patterns – you guessed it – using predictive analytics.
  • Allocating budgets in a company to maximize the total profit in the upcoming year is predictive analytics. This is simple at a startup, but imagine the situation in a company like Google, with thousands of departments and employees, all clamoring for funding. Predictive Analytics is the way to go in this case as well.
  • IoT (Internet of Things) smart devices are one of the most promising applications of predictive analytics. It will not be too long before the sensor data from aircraft parts use predictive analytics to tell its operators that it has a high likelihood of failure. Ditto for cars, refrigerators, military equipment, military infrastructure and aircraft, anything that uses IoT (which is nearly every embedded processing device available in the 21st century).
  • Fraud detection, malware detection, hacker intrusion detection, cryptocurrency hacking, and cryptocurrency theft are all ideal use cases for predictive analytics. In this case, the ML system detects anomalous behavior on an interface used by the hackers and cybercriminals to identify when a theft or a fraud is taking place, has taken place, or will take place in the future. Obviously, this is a dream come true for law enforcement agencies.

So now you know what predictive analytics is and what it can do. Now let’s come to the revolutionary new technology.

Meet Endor – The ‘Social Physics’ Phenomenon

 

Image result for endor image free to use

End-to-End Predictive Analytics Product – for non-tech users!

 

In a remarkable first, a research team at MIT, USA have created a new science called social physics, or sociophysics. Now, much about this field is deliberately kept highly confidential, because of its massive disruptive power as far as data science is concerned, especially predictive analytics. The only requirement of this science is that the system being modeled has to be a human-interaction based environment. To keep the discussion simple, we shall explain the entire system in points.

  • All systems in which human beings are involved follow scientific laws.
  • These laws have been identified, verified experimentally and derived scientifically.
  • Bylaws we mean equations, such as (just an example) Newton’s second law: F = m.a (Force equals mass times acceleration)
  • These equations establish laws of invariance – that are the same regardless of which human-interaction system is being modeled.
  • Hence the term social physics – like Maxwell’s laws of electromagnetism or Newton’s theory of gravitation, these laws are a new discovery that are universal as long as the agents interacting in the system are humans.
  • The invariance and universality of these laws have two important consequences:
    1. The need for large amounts of data disappears – Because of the laws, many of the predictive capacities of the model can be obtained with a minimal amount of data. Hence small companies now have the power to use analytics that was mostly used by the FAMGA (Facebook, Amazon, Microsoft, Google, Apple) set of companies since they were the only ones with the money to maintain Big Data warehouses and data lakes.
    2. There is no need for data cleaning. Since the model being used is canonical, it is independent of data problems like outliers, missing data, nonsense data, unavailable data, and data corruption. This is due to the orthogonality of the model ( a Knowledge Sphere) being constructed and the data available.
  • Performance is superior to deep learning, Google TensorFlow, Python, R, Julia, PyTorch, and scikit-learn. Consistently, the model has outscored the latter models in Kaggle competitions, without any data pre-processing or data preparation and cleansing!
  • Data being orthogonal to interpretation and manipulation means that encrypted data can be used as-is. There is no need to decrypt encrypted data to perform a data science task or experiment. This is significant because the independence of the model functioning even for encrypted data opens the door to blockchain technology and blockchain data to be used in standard data science tasks. Furthermore, this allows hashing techniques to be used to hide confidential data and perform the data mining task without any knowledge of what the data indicates.

Are You Serious?

Image result for OMG image

That’s a valid question given these claims! And that is why I recommend everyone who has the slightest or smallest interest in data science to visit and completely read and explore the following links:

  1. https://www.endor.com
  2. https://www.endor.com/white-paper
  3. http://socialphysics.media.mit.edu/
  4. https://en.wikipedia.org/wiki/Social_physics

Now when I say completely read, I mean completely read. Visit every section and read every bit of text that is available on the three sites above. You will soon understand why this is such a revolutionary idea.

  1. https://ssir.org/book_reviews/entry/going_with_the_idea_flow#
  2. https://www.datanami.com/2014/05/21/social-physics-harnesses-big-data-predict-human-behavior/

These links above are articles about the social physics book and about the science of sociophysics in general.

For more details, please visit the following articles on Medium. These further document Endor.coin, a cryptocurrency built around the idea of sharing data with the public and getting paid for using the system and usage of your data. Preferably, read all, if busy, at least read Article No, 1.

  1. https://medium.com/endor/ama-session-with-prof-alex-sandy-pentland
  2. https://medium.com/endor/endor-token-distribution
  3. https://medium.com/endor/https-medium-com-endor-paradigm-shift-ai-predictive-analytics
  4. https://medium.com/endor/unleash-the-power-of-your-data

Operation of the Endor System

Upon every data set, the first action performed by the Endor Analytics Platform is clustering, also popularly known as automatic classification. Endor constructs what is known as a Knowledge Sphere, a canonical representation of the data set which can be constructed even with 10% of the data volume needed for the same project when deep learning was used.

Creation of the Knowledge Sphere takes 1-4 hours for a billion records dataset (which is pretty standard these days).

Now an explanation of the mathematics behind social physics is beyond our scope, but I will include the change in the data science process when the Endor platform was compared to a deep learning system built to solve the same problem the traditional way (with a 6-figure salary expert data scientist).

An edited excerpt from https://www.endor.com/white-paper:

From Appendix A: Social Physics Explained, Section 3.1, pages 28-34 (some material not included):

Prediction Demonstration using the Endor System:

Data:
The data that was used in this example originated from a retail financial investment platform
and contained the entire investment transactions of members of an investment community.
The data was anonymized and made public for research purposes at MIT (the data can be
shared upon request).

 

Summary of the dataset:
– 7 days of data
– 3,719,023 rows
– 178,266 unique users

 

Automatic Clusters Extraction:
Upon first analysis of the data the Endor system detects and extracts “behavioral clusters” – groups of
users whose data dynamics violates the mathematical invariances of the Social Physics. These clusters
are based on all the columns of the data, but is limited only to the last 7 days – as this is the data that
was provided to the system as input.

 

Behavioural Clusters Summary

Number of clusters:268,218
Clusters sizes: 62 (Mean), 15 (Median), 52508 (Max), 5 (Min)
Clusters per user:164 (Mean), 118 (Median), 703 (Max), 2 (Min)
Users in clusters: 102,770 out of the 178,266 users
Records per user: 6 (Median), 33 (Mean): applies only to users in clusters

 

Prediction Queries
The following prediction queries were defined:
1. New users to become “whales”: users who joined in the last 2 weeks that will generate at least
$500 in commission in the next 90 days
2. Reducing activity : users who were active in the last week that will reduce activity by 50% in the
next 30 days (but will not churn, and will still continue trading)
3. Churn in “whales”: currently active “whales” (as defined by their activity during the last 90 days),
who were active in the past week, to become inactive for the next 30 days
4. Will trade in Apple share for the first time: users who had never invested in Apple share, and
would buy it for the first time in the coming 30 days

 

Knowledge Sphere Manifestation of Queries
It is again important to note that the definition of the search queries is completely orthogonal to the
extraction of behavioral clusters and the generation of the Knowledge Sphere, which was done
independently of the queries definition.

Therefore, it is interesting to analyze the manifestation of the queries in the clusters detected by the system: Do the clusters contain information that is relevant to the definition of the queries, despite the fact that:

1. The clusters were extracted in a fully automatic way, using no semantic information about the
data, and –

2. The queries were defined after the clusters were extracted, and did not affect this process.

This analysis is done by measuring the number of clusters that contain a very high concentration of
“samples”; In other words, by looking for clusters that contain “many more examples than statistically
expected”.

A high number of such clusters (provided that it is significantly higher than the amount
received when randomly sampling the same population) proves the ability of this process to extract
valuable relevant semantic insights in a fully automatic way.

 

Comparison to Google TensorFlow

In this section a comparison between prediction process of the Endor system and Google’s
TensorFlow is presented. It is important to note that TensorFlow, like any other Deep Learning library,
faces some difficulties when dealing with data similar to the one under discussion:

1. An extremely uneven distribution of the number of records per user requires some canonization
of the data, which in turn requires:

2. Some manual work, done by an individual who has at least some understanding of data
science.

3. Some understanding of the semantics of the data, that requires an investment of time, as
well as access to the owner or provider of the data

4. A single-class classification, using an extremely uneven distribution of positive vs. negative
samples, tends to lead to the overfitting of the results and require some non-trivial maneuvering.

This again necessitates the involvement of an expert in Deep Learning (unlike the Endor system
which can be used by Business, Product or Marketing experts, with no perquisites in Machine
Learning or Data Science).

 

Traditional Methods

An expert in Deep Learning spent 2 weeks crafting a solution that would be based
on TensorFlow and has sufficient expertise to be able to handle the data. The solution that was created
used the following auxiliary techniques:

1.Trimming the data sequence to 200 records per customer, and padding the streams for users
who have less than 200 records with neutral records.

2.Creating 200 training sets, each having 1,000 customers (50% known positive labels, 50%
unknown) and then using these training sets to train the model.

3.Using sequence classification (RNN with 128 LSTMs) with 2 output neurons (positive,
negative), with the overall result being the difference between the scores of the two.

Observations (all statistics available in the white paper – and it’s stunning)

1.Endor outperforms Tensor Flow in 3 out of 4 queries, and results in the same accuracy in the 4th
.
2.The superiority of Endor is increasingly evident as the task becomes “more difficult” – focusing on
the top-100 rather than the top-500.

3.There is a clear distinction between “less dynamic queries” (becoming a whale, churn, reduce
activity” – for which static signals should likely be easier to detect) than the “Who will trade in
Apple for the first time” query, which are (a) more dynamic, and (b) have a very low baseline, such
that for the latter, Endor is 10x times more accurate!

4.As previously mentioned – the Tensor Flow results illustrated here employ 2 weeks of manual
improvements done by a Deep Learning expert, whereas the Endor results are 100% automatic and the entire prediction process in Endor took 4 hours.

Clearly, the path going forward for predictive analytics and data science is Endor, Endor, and Endor again!

Predictions for the Future

Personally, one thing has me sold – the robustness of the Endor system to handle noise and missing data. Earlier, this was the biggest bane of the data scientist in most companies (when data engineers are not available). 90% of the time of a professional data scientist would go into data cleaning and data preprocessing since our ML models were acutely sensitive to noise. This is the first solution that has eliminated this ‘grunt’ level work from data science completely.

The second prediction: the Endor system works upon principles of human interaction dynamics. My intuition tells me that data collected at random has its own dynamical systems that appear clearly to experts in complexity theory. I am completely certain that just as this tool developed a prediction tool with human society dynamical laws, data collected in general has its own laws of invariance. And the first person to identify these laws and build another Endor-style platform on them will be at the top of the data science pyramid – the alpha unicorn.

Final prediction – democratizing data science means that now data scientists are not required to have six-figure salaries. The success of the Endor platform means that anyone can perform advanced data science without resorting to TensorFlow, Python, R, Anaconda, etc. This platform will completely disrupt the entire data science technological sector. The first people to master it and build upon it to formalize the rules of invariance in the case of general data dynamics will for sure make a killing.

It is an exciting time to be a data science researcher!

Data Science is a broad field and it would require quite a few things to learn to master all these skills.

Dimensionless has several resources to get started with.

Sourced from Dimensionless

By Vadim Revzin and Sergei Revzin

If you were entering the job market in the early 90s, most job descriptions included “Macintosh experience” or “excellent PC skills” in their preferred qualifications. This quickly became a requirement for even the most non-technical jobs, forcing people across every industry and age group to adapt with the changing times, or risk getting left behind.

Today, the bar for computer proficiency is set much higher. There’s an ever-increasing demand for people who can leverage software to analyze, understand, and make day-to-day business decisions based on data. Data Science is now a quickly growing discipline, giving people with any kind of data expertise a serious competitive edge.

Corporate leaders are becoming convinced of the impact that effective data collection and analysis can have on the bottom line, from tracking daily reports against Key Performance Indicators to make informed decisions on where to spend marketing dollars, to monitoring and evaluating customer communications to adjust product offerings. Many are investing heavily in hiring talent with data skills and building out data proficiency across the organization.

If you see this as an important step in the evolution of your business, there’s a lot you can do to improve data skills among existing employees without spending a ton of money on expensive consultants or full-time data experts. This all starts with thinking carefully about how employees are motivated, and how you can have the right reward systems in place to achieve your desired goal.

Five years ago, Jack Welch famously stated that there are three fundamental ways to motivate employees: financial rewards, recognition, and a clear mission. Unlike Welch’s 41-year tenure at GE, today’s employees are expected to hold an average of 10 jobs before the age of 40. Because of this, a fourth motivational principle must be added: personal growth and development.

How can each of these principles be applied to building data skills across teams?

To answer that question, we need to start with the basics. Creating any kind of cultural transformation requires a long-term commitment, and that expectation should be set from the start across the various stakeholders interested in bringing the organization into the data-driven era. With that said, if you take the right steps early on, you can set yourself up for success in the future, and this starts with:

Aligning the company towards the new mission

Since this is first and foremost the responsibility of leadership, early executive buy-in on becoming a more data-driven company is paramount. Getting teams and individual contributors to form new habits comes down to leading by example. As is so often the case, the smallest changes can have the biggest impact.

Take your weekly Monday morning all hands meeting — an opportunity to share important updates, clarify short-term goals, and motivate the team to keep pushing forward toward the main vision. This is the perfect chance to change the way you communicate to better highlight your changing strategy.

Has the company decided to pursue a new business vertical based on data collected by the sales team in the field? Take this opportunity to educate other teams in the organization by clarifying how the team was able to successfully leverage data to validate the demand in this new vertical — from setting up customer interviews, to tracking responses in a spreadsheet and reviewing them as a team.

Just taking this one step can motivate others in the company to start thinking about ways that they can do the same thing in their own roles — after all, the sales team must be doing something right to be singled out during the all hands meeting.

You can also encourage team leads and managers to be more deliberate about highlighting successful outcomes from using data.

If a sales manager has been tracking the performance of sales efforts against a new vertical, he should be able to quickly gather some valuable insights that the rest of the organization would benefit from understanding. A clear example of how using data is already starting to drive more revenue for the organization might be: “Over the last week, after selecting two of our leading sales reps to focus on pitching this new customer segment, we noticed that the time to close a new customer went down from five days to two days, with the average contract size increasing by $500.”

Highlighting wins like this does a few things. It builds trust from employees that can now clearly see that the company is deliberate in how it makes important decisions. It also motivates colleagues to emulate their peers to have an opportunity to be mentioned by leadership in the next all hands meeting.

Leadership should encourage various department heads to take a similar approach in their communication. Any meeting in front of the whole team can be used to share takeaways. Perhaps news recently came out about a competitor that was able to take advantage of a new tool to optimize their marketing funnel. Share these case studies with the team to encourage them to think about how a process change or new tool might be able to help with their job.

Another way to make data top of mind is to display it all over the office. Install a TV showing a few data dashboards. Is real-time web traffic an important metric for the team to keep an eye on? Load up a dashboard from Google Analytics and have it always running. People will start to notice trends, like when traffic spikes during the day or when social media activity is at its peak, and can then have impromptu brainstorming discussions around how things can be improved.

As people start to understand the importance of thinking through the lens of data, some employees will display a personal desire to learn new data skills. Growing as a professional, and learning new hard skills has been proven to lead to more job satisfaction, which is why one of the best ways to incentivize employees is to create opportunities for professional development.

Focusing on people’s personal growth

Google famously focused on employees’ personal growth with their 20% rule, where employees were allowed to spend 20% of their time working on personal projects. Similarly, you can work with your managers to create a culture across the organization where spending time on self-study around acquiring data skills is encouraged.

Ask people to consume relevant content about how data can be used in their roles, and use your internal chat app to share interesting and relevant articles that employees find throughout the week.

If someone takes an interest in diving deep into a particular solution, like Google Analytics or Mixpanel, give them time during the week to become certified in those tools. You can also give managers the freedom to approve inexpensive online seminars and courses for those who are interested, proving that the company truly cares about investing in its talent. If you want to go the extra mile, you can even offer to cover the cost for anyone interested in taking evening or weekend classes around topics like data science.

As expertise grows, certain team members will take more initiative and start to stand out from the rest. This is a perfect opportunity to allow employees to learn from each other.

If you see that someone excels at manipulating data to provide new insights, give them a platform to train their peers, encouraging knowledge sharing within teams and across departments. People will appreciate the ability to take on new responsibilities like this and feel positive about being seen as a domain expert.

Another powerful way to impact how employees feel and where they place extra effort is to offer individual recognition.

Private and public recognition

It’s important to embed a practice where people are consistently recognized for great work, and this is very simple to do through existing channels.

Train managers to focus on providing private recognition to exceptional employees. Many sales teams integrate weekly 1:1 meetings between managers and employees to identify challenges, and offer help. This could be a great time to spend the first few minutes of each meeting congratulating someone if they are particularly diligent with tracking data, or submitting critical reports on time.

Create a “Data Expert of the Week” award, where you share success stories from specific employees during a standing meeting, via email, or in your favorite chat app in front of the whole company. You can even offer a little financial incentive by asking people to nominate someone else on the team, offering a $200 gift card to the person that gets the most votes. This helps people feel appreciated by their peers, and provides a little extra monetary motivation.

If possible, you can also offer extra quarterly or annual bonuses for those who are truly transforming the way certain things are done. If your engineering team dedicated time to implement a new solution that tracks additional user metrics within your application, and this information becomes critical to your understanding of your customers, reward them by giving people on that team bonuses, signaling to employees on other teams to take similar initiative.

The use cases for leveraging data to build value are limitless, so it’s inevitable that data will continue become a bigger part of our day-to-day work. Jack Welch gave us a brilliant blueprint for motivating people to do great things. Incentivizing your team to become more data savvy is just one way you can achieve this greatness.

By Vadim Revzin and Sergei Revzin

Vadim Revzin is an Entrepreneur in Residence at GenFKD, a national non-profit, where he teaches entrepreneurship at State University of New York and is the co-host of a weekly podcast called The Mentors featuring stories from successful founders and creators. He’s advised hundreds of startups, and has been both a founder and leader across several early and growth stage startups.


Sergei Revzin is a Venture Investor at the NYU Innovation Venture Fund where he leads the university’s technology investments and is the co-host of The Mentors podcast with his twin brother Vadim. He has mentored hundreds of entrepreneurs all over the country through his work with Venture for America, and has been an early employee and founder at tech companies in NYC and Boston.

Sourced from Harvard Business Review

By 

After years of holding the data close to its vest, Google has begun to give advertisers more data to help them make better decisions and run successful campaigns. Earlier this month, Google confirmed that it would run a small-scale rollout of an Insights analytics report in Google My Business that shows business owners the most popular search keywords that people use to find listings.

On Friday Google announced that the Search Analytics API found in the Search Console now allows advertisers to retrieve 25,000 rows of data per request, up from 5,000 rows previously. Marketers can query all their search analytics data without exceeding their quota by running a daily query for one day’s worth of data.

Marketers need to choose the information requested, such as search types — web, image and video — along with the dimensions such as page, query, country, or device and whether to group results by page or property.

Along with the news, Google published a guide to take marketers through data retrieval. It includes an overview and describes how to group results by page or property and the dos and don’ts for the process, as well as defaults and nuances of how the queries work.

Google also notes that impressions, clicks, position, and click-through rates are calculated differently when grouping results by page rather than by property.

Earlier this week, Google announced the integration of Hotel Ads into the Google Ads platform with the introduction of a new type of campaign and a new dashboard for managing hotel price feeds.

Although Hotel Ads have been around for about eight years — initially in sponsored listings in Google Maps and then in Google Search — they were managed in a separate ad platform.

Now all the data resides in one place. Overall, it means marketers gain more data from one dashboard to support campaigns across the board.

By 

Sourced from MediaPost

Researchers need to be aware of the mistakes that can be made when for mining social-media data.

By MediaStreet Staff Writers

A growing number of academic researchers are mining social media data to learn about both online and offline human behaviour. In recent years, studies have claimed the ability to predict everything from summer blockbusters to fluctuations in the stock market.

But mounting evidence of flaws in many of these studies points to a need for researchers to be wary of serious pitfalls that arise when working with huge social media data sets. That is, according to computer scientists at McGill University in Montreal and Carnegie Mellon University in Pittsburgh.

Such erroneous results can have huge implications: thousands of research papers each year are now based on data gleaned from social media. “Many of these papers are used to inform and justify decisions and investments among the public and in industry and government,” says Derek Ruths, an assistant professor in McGill’s School of Computer Science.

Ruths and Jürgen Pfeffer of Carnegie Mellon’s Institute for Software Research highlight several issues involved in using social media data sets – along with strategies to address them. Among the challenges:

  • Different social media platforms attract different users – Pinterest, for example, is dominated by females aged 25-34 – yet researchers rarely correct for the distorted picture these populations can produce.
  • Publicly-available data feeds used in social media research don’t always provide an accurate representation of the platform’s overall data – and researchers are generally in the dark about when and how social media providers filter their data streams.
  • The design of social media platforms can dictate how users behave and, therefore, what behaviour can be measured. For instance, on Facebook the absence of a “dislike” button makes negative responses to content harder to detect than positive “likes.”
  • Large numbers of spammers and bots, which masquerade as normal users on social media, get mistakenly incorporated into many measurements and predictions of human behaviour.
  • Researchers often report results for groups of easy-to-classify users, topics, and events, making new methods seem more accurate than they actually are. For instance, efforts to infer political orientation of Twitter users achieve barely 65% accuracy for typical users – even though studies (focusing on politically active users) have claimed 90% accuracy.

Many of these problems have well-known solutions from other fields such as epidemiology, statistics, and machine learning, Ruths and Pfeffer write. “The common thread in all these issues is the need for researchers to be more acutely aware of what they’re actually analysing when working with social media data,” Ruths says.

Social scientists have honed their techniques and standards to deal with this sort of challenge before.

The infamous ‘Dewey Defeats Truman’ headline of 1948 stemmed from telephone surveys that under-sampled Truman supporters in the general population. Rather than permanently discrediting the practice of polling, that glaring error led to today’s more sophisticated techniques, higher standards, and more accurate polls. Says Ruths, “Now, we’re poised at a similar technological inflection point. By tackling the issues we face, we’ll be able to realise the tremendous potential for good promised by social media-based research.”

 

 

By Mark Eggleton

Data is referred to as the oil or even the soil of the 21st century. Either analogy is apt as they both illustrate the importance of data and how it will power or feed the world economy in the digital age. Data is the fuel of the future as well as the rich soil which everything will grow.

Its ubiquity already spreads far and wide ranging from the consumer data story such as knowledge of our online search history, financial transactions and social media interactions. More widely, data drives the analysis of whole industries, small businesses or traffic patterns.

It has moved far beyond crunching the numbers to find out your age, gender and income to gaining real time insights into how the world works and (hopefully) how it can be improved.

A good example would be the humble automobile. There are millions of cars on the road yet most of them sit idle over 95 per cent of the time. Better use of data should reduce the number of vehicles on the roads and increase the utilisation rate of cars actually on the roads. Drive share programs and driverless vehicles will mean fewer cars are sold but the data generated by each one as well as each user and their travel patterns will be invaluable.

Yet while every technological soothsayer suggests all of this is just around the corner, there’s still an extraordinary amount we don’t know as the new economy takes shape.

NSW Chief Data Scientist and CEO of the NSW Data Analytics Centre Dr Ian Oppermann is a little more sanguine about where we are now.

“The way companies have been taking advantage of a new digital future is better targeted services, more individual services. Quite often we talk about know your customer or a market size of one.

“For government, we’re helping agencies re-think the delivery of services. It’s doing old things in new ways where you can actually look across barriers, across agency boundaries and silos.”

Oppermann warns organisations should not fall into the trap of making decisions based solely on data as it’s a very simplistic observation of the world.

“What we do with data analytics or with artificial intelligence is we try to recapture the information in that data and then make informed decisions based on the little pieces of information scattered throughout many different sources.” (Dr Ian Oppermann, NSW Chief Data Scientist and CEO of the NSW Data Analytics Centre)

He cautions against blind faith in algorithms or in data “as a dangerous place to be”.

“It’s like following the GPS down a goat track even though when you look outside into the real world you realise you really should not be driving down that road. As long as we’re aware of the risk and as long as we question the results then I think we stand ourselves in good stead to make better decisions.

“What we ultimately want, is to help people use more data from a variety of sources to assist them make better decisions but we also have to build trust.”

Oppermann says trust comes with so many different aspects and it’s an evolving journey. He cites what we already do with banks as an example of how far along the journey we already are when it comes to digital trust.

“The data which a bank holds, is our salaries, it’s our pay – it’s something we can translate to cash but realistically banks are data centres or trust centres. We are quite comfortable having our salary paid into a bank where it goes in as data, sits there as data and we draw it out as data until we use an ATM. All the while it’s just data until we get it to manifest as a polymer bank note.

“We trust it implicitly and explicitly because we have been trusting banks for hundreds of years. Most of the time we’re pretty comfortable dealing with a bank but if you take that same data and say, now this data isn’t money, it is information about me or information about my preferences or other people we don’t actually have that same level of trust.

“Even if the governance processes, the security processes, the decision-making algorithms, are exactly the same we don’t have that same level of trust because we are not used to the idea of a government or a Google or a web services company delivering services to us in a way that we’ve interacted with for hundreds of years.”

For Oppermann, our data journey is just similar to the journey from gold to paper, to bank notes and now to data. He says ultimately trust will build slowly because of reliable and expected performance.

Part of the trust problem exists around the fear of too much information being held by too few. The big data refineries such as Amazon, Alphabet, Facebook and Apple already have a monstrous first mover advantage as do many financial institutions. This has bred the fear they are too large and are in danger of becoming monopolistic in the same way Standard Oil was in the United States in the early 20th century. Many have asked the question as to whether they need to be broken up or heavily regulated.

Partner in Charge, KPMG’s Data & Analytics in the Netherlands Professor Sander Klous, says governments are trying to play catch-up but it’s difficult because the old rules around ethics have been thrown on their head.

“We know how to apply them to human activity but how do you apply ethics to an algorithm? It’s something completely new,” he says.

As to whether the big data refineries should be broken up, he indicates that data is the new element in antitrust considerations. It’s a winner takes all ecosystem, where it becomes impossible to outperform the largest players because of all the data they possess.

“It’s a bit like the big banks where governments wanted to exercise some control over them because they became too big to fail. It’s the same with Google or a Facebook, if either of them broke down tomorrow, you could claim they are probably too big to fail as well,” Klous says.

Klous does suggest the sheer size of the large data refineries will see them eventually broken up because they’re unsustainable.

“The data refineries are basically just really big pipes where raw material is processed and something smart comes out. So, the whole idea that one party is controlling that pipeline is too rigid to be a sustainable model.

“I think what you need is multiple parties that are able to work together in a platform like structure and the data refinery has to become more complex because there is more than one party in control.”

He draws an analogy with traffic lights where one party is in control as opposed to a roundabout which is a simpler concept but more parties in control as long as everyone abides by a simple rule.

“In a roundabout, there is a simple standard where right goes first and then you make your decision to enter and everything flows. The same applies in a data environment where there there can be a simple set of standards that need to be complied with and then as you add intelligence (or information) to the data refinery it informs decisions. You’re not relying on one single party to make the right decisions.”

He says data refineries will eventually turn into this platform model where multiple parties will collaborate to create value and eventually domain specific refineries will develop in areas around eg. health or logistics where the dominant players will work together.

As for the future, Klous says we are inevitably moving towards a smart society but there are things we need to get under control. Ideally, he would like to see some sort of control framework in place that allows individuals to be able to trust what the large data refineries are actually doing.

“We are sorting out how to deal properly with privacy and other ethical issues without losing benefits like more convenience or increased efficiency.”

 

By Mark Eggleton

Sourced from Reports.afr

Researchers urged to hone methods for mining social-media data, or investment in marketing will be wasted.

By MediaStreet Staff Writers

A growing number of people, from marketers to academic researchers, are mining social media data to learn about both online and offline human behaviour. In recent years, studies have claimed the ability to predict everything from summer blockbusters to fluctuations in the stock market.

But mounting evidence of flaws in many of these studies points to a need for researchers to be wary of serious pitfalls that arise when working with huge social media data sets. This is according to computer scientists at McGill University and Carnegie Mellon University.

Such erroneous results can have huge implications on data gleaned from social media. A lot of marketing investment could be placed in the wrong areas.

The challenges involved in using data mined from social media include:

  • Different social media platforms attract different users – Pinterest, for example, is dominated by females aged 25-34 – yet researchers rarely correct for the distorted picture these populations can produce.
  • Publicly available data feeds used in social media research don’t always provide an accurate representation of the platform’s overall data – and researchers are generally in the dark about when and how social media providers filter their data streams.
  • The design of social media platforms can dictate how users behave and, therefore, what behaviour can be measured. For instance, on Facebook the absence of a “dislike” button makes negative responses to content harder to detect than positive “likes.”
  • Large numbers of spammers and bots, which masquerade as normal users on social media, get mistakenly incorporated into many measurements and predictions of human behaviour.
  • Researchers often report results for groups of easy-to-classify users, topics, and events, making new methods seem more accurate than they actually are. For instance, efforts to infer political orientation of Twitter users achieve barely 65% accuracy for typical users – even though studies (focusing on politically active users) have claimed 90% accuracy.

Many of these problems have well-known solutions from other fields such as epidemiology, statistics, and machine learning. The common thread in all these issues is the need for researchers to be more acutely aware of what they’re actually analysing when working with social media data.

Social scientists have honed their techniques and standards to deal with this sort of challenge before. Says Derek Ruths, an assistant professor in McGill’s School of Computer Science, “The infamous ‘Dewey Defeats Truman’ headline of 1948 stemmed from telephone surveys that under-sampled Truman supporters in the general population. Rather than permanently discrediting the practice of polling, that glaring error led to today’s more sophisticated techniques, higher standards, and more accurate polls. Now, we’re poised at a similar technological inflection point. By tackling the issues we face, we’ll be able to realise the tremendous potential for good promised by social media-based research.”

 

Advertising is strangling the web, and that may have to do with the declining value and lack of transparency associated with the players that dominate it. (Image: Anders Emil Møller / Trouble).

By for Big on Data.

The problem with advertising data and what to do about it. Plus, the future of big data architecture, and other stories from the Ad Tech trenches.

The greatest minds of this generation are wasted on advertisement. Or at least, that’s what someone who has been there and done that thinks. Like most successful aphorisms, this raises eyebrows, drives heated discussions, and strikes a point or two.

Marketing and advertising have enormous influence on society at large — business, technology, media, culture, and data. So, discussing with people working on the intersection of those can offer some insights on the state of the union of Big Data and Ad Tech.

Advertising is big, and so is its data

Advertising is a multi-billion dollar business that has been going through the process of digital transformation for a couple of decades already. Some of today’s most advanced, powerful, and influential companies have advertising embedded in their core.

A good part of the innovation that has been driving big data has come about as a response to the needs of advertising at scale before getting a life of its own. MapReduce, for example, the blueprint for Hadoop’s first incarnation, was originally developed and deployed at scale at Google.

But although Facebook and Google, the ‘Big Two,’ are by far the biggest players in the digital advertising space, they are not the only ones. The Ad Tech, or Marketing Tech, scene is booming, and programmatic marketing is taking over quickly.

Mike Driscoll, CEO of Metamarkets, points out that marketing is being digitally transformed and marketers are following suite. Metamarkets is part of the Ad Tech wave, and its core business is to provide marketers with insights on their digital presence.

“The future will be digital,” Driscoll says. “CMOs (Chief Marketing Officers) are turning to CMTOs (Chief Marketing Technical Officers). But as marketers are going digital, they are also starting to have less trust in some of the channels they’re buying from. Investing in technology means they are now able to hold their partners accountable.”

The problem with advertising data

Driscoll has more than anecdotal evidence and opinions here. Metamarkets just published a survey called the Transparency Opportunity, in which it attempts, as it says, to quantify the benefits of trust.

The findings of the survey show that almost half of the brands using programmatic media buying believe lack of transparency is inhibiting its future growth and scale. But what do we talk about when we talk about transparency here?

“All marketers work with the established duopoly — Google and Facebook,” says Driscoll. Metamarkets also works with other platforms, such as Twitter and AOL, but it’s the Big Two that dominate the advertising market. While the market is growing, nearly all of that growth is driven by Google and Facebook.

c07mroxcaaqf0m.jpg
The advertising pie may be growing, but the growth is driven by a duopoly. (Image: Jason Kint)

It’s not hard to see where this is going: Google and Facebook dominating the market and dictating their terms. This presents a problem for all parties involved. For media, depen

dence on advertising translates as dependence on the Big Two. Media are trying to find ways to cope and come up with new models of doing business while maintaining their editorial independence.

For consumers, the ever-increasing volume of advertising means they are constantly bombarded by a barrage of ads. They are told that this is the price they have to pay for having access to free content, and to a certain extent, it is true. The problem is that more and more advertising is strangling content, consumer fatigue is taking over, and the value and effectiveness of advertising is dropping.

Obviously, this presents a problem for advertisers and their clients. “Marketers want more transparency. They would like to get a receipt for what they buy, instead of a powerpoint and a good story,” Driscoll says. “Brands are asking from their partners to provide better analytics. Historically, channels have been providing results, but not analytics on the results.

Digital transformation means that we don’t just buy goods anymore, we also buy data about the goods. Take AWS, for example: When they started, they just provided the service, but by now, you also get analytics to go with it. Major channels need to invest more not just in internal technology, but also in providing better data access to their partners.”

The big guys will just not share

But, seriously, this is Google and Facebook we’re talking about. Are we to believe that the most iconic data-driven organizations in the world can’t make the right data available to marketers? “Ask any marketer and they’ll tell you — the big guys will just not share. It’s not in their interest to be transparent, but rather to be as less transparent as possible,” Driscoll says.

transparencyinhibiting.jpg
Almost half of brands being advertised see lack of transparency as a problem. Image: MetaMarkets

So, what can be done to deal with this? The real power of marketers is the power to check them, according to Driscoll. “If you look at the leaders emerging in Fortune 500, the next generation of marketers are technologists, and they are demanding independent audits and data. Consider this:

For a long time, advertisers wanted to know if their ads were viewed or not. Facebook says, ‘OK, we’ll measure it ourselves.’ And they got away with it for a while. They reported their own view-ability stats, just like NBC used to report how many people viewed their own shows.

In the last year, that has changed. Marketers says, ‘We will not send advertisements to Facebook, unless we have an independent source of truth.’ So, Facebook responded by providing access to their data to a company called Moat, which built a business model around auditing Facebook data.

That does not mean Facebook and Co., will give away all of their data — you also have to consider privacy issues here. But when you talk to brands, even though they will not say that in public, they are actually doing that. When you have budgets in the tends of millions, you can do that — pull data out and do what every marketer would like to do: Build a unified view over their channels.”

Analyst super powers

But what about the rest of the world, the ones that don’t have the budget to cope with this? Perhaps regulation would be needed, so if the big guys won’t share, someone should make them?

“We’ve been hearing rumours about the Congress getting involved, but for most businesses that would be the last resort. It’s not the ideal solution for marketers or media companies, especially considering the all-time low approval ratings in the US right now,” Driscoll says. For him, the answer is in marketers investing more in analytics.

On the one end of the continuum, organizations can do it all themselves, using infrastructure like Hadoop and analytics tools that sit on top and can help them collect and analyze the data they need. On the other end, Metamarkets touts itself as the right solution for marketers.

Metamarkets is a domain-specific solution that builds on four pillars: Fast data exploration, intuitive visualization, collaboration, and intelligence. Driscoll elaborates: “Scale is a requirement, and we are quickly moving towards streaming events and data.

Interactive visualization helps you understand what’s going on. You need more than dashboards. Dashboards may update, but the questions they answer stay the same. You need collaboration — like Slack for data, that helps teams communicate and share methods and insights.

And you need intelligence. In analytics, you spend 80 percent of your time preparing data and 20 percent actually doing analysis. We have ETL connectors for a multitude of platforms that help get the data where you need them. Plus, it’s one thing to show data, and another thing to search for insights.”

Metamarkets tries to look at what analysts do and automate that to suggest root causes. For example, a campaign running behind targets is something that can be monitored using metrics. But to get to the reason why this is happening, an analyst would slice and dice data per region or demographics.

Metamarkets says they can automate this process and suggest root causes, evolving from tracking statistical significant signals to deriving business-focused insights. “We let analysts specify metrics they are interested in, and then perform root cause analysis for them. We believe in machine and human working side by side, not in replacing analysts, but in giving them super-powers,” Driscoll says.

Data at advertising scale and the future of pipelines

As Metamarkets has been on the forefront of data at advertising scale, and Driscoll himself has served as its CTO, he shared some insights on the evolution of big data architecture: “We have been pushing the limits of scale, so we encounter problems before others do,” he says.

This has resulted in MetaMarkets developing and releasing Druid, an open-source distributed column store. “We created Druid because we needed it and it did not exist, so we had to build it. And then we open sourced it, because if we had not, something else would have come along and replaced it.”

image04-768x352.png
MetaMarkets has been evolving its data pipeline, but still not turning to Kappa architecture on the grounds that its clients are not ready for it. (Image: MetaMarkets)

Druid is seeing some traction in the industry. Case in point, when Hortonworks’ engineers recently presented their work on the combined used of Hive & Druid at the DataWorks EMEA Summit, they attracted widespread interest. Could this mean there may be a valid case for building a business around Druid?

“We have the largest deployment in production, and we love being part of the community. Druid is used by the likes of Airbnb and Ali Baba. But we have no plans of building a business around it. We don’t believe the future is around data infrastructure, which is becoming a commodity, and we don’t want to be competing against the Googles of the world there.

Sure, this may be working for companies built around Hadoop, but commercialization of open source needs widespread adoption to succeed. But I can tell you that Cloudera and Hortonworks are looking to add Druid to their stack and to the range of services they offer.”

Read also: Has the Hadoop market turned a corner? | Open source big data and DevOps tools: A fast path to analytics applications | Finding the anomalies in big data with machine learning | Cloudera’s new data science tool aims to boost big data and machine learning for businesses (TechRepublic)

Driscoll does not believe in horizontally expanding Metamarkets, even though its experience in building data pipelines at scale could in theory be applied to other domains beyond advertising. Its own pipeline has been evolving, going from Hadoop to Spark and from Storm to Samza.

“Spark is more mature and it meets our needs at this point, and we also feel about the same way about Samza,” he says. “But we see streaming as the future of our pipeline. When you work with streaming, there’s a sort of CAP theorem equivalent that applies there.

In distributed data stores, you have consistency, availability and partition tolerance, and you can pick two of those that your system supports simultaneously. In streaming data, you have accuracy, velocity, and volume, and your system can only support two of those simultaneously.

This is why we think the model supported by Apache Beam, Google Data Flow, and Apache Flink will be key going forward. When streaming at scale, there’s no such thing as objective truth, so you have to rely on statistical approximation and on using watermarks.

Do we see our current Lambda architecture giving way to a flattened, Kappa architecture? When you work on the bleeding edge of real-time architecture, the ability of organizations like Metamarkets that are in the business of integrating data from other sources is important.

But when it comes to other companies, not many are yet at the point where they can stream data out. Only the most sophisticated, agile companies out there are able to do this. At this point, only about 50 percent of our clients are there.”

By for Big on Data

Sourced from ZDNet

Sourced from CRAIN’S New York Business

Time Inc. is planning to sell some magazines or other properties as the struggling publisher tries to push ahead with a digital strategy and move past months of talks with potential acquirers.

The owner of Sports Illustrated and People will look to offload “relatively smaller” titles in its portfolio and other “non-core” assets, chief executive Rich Battista said Wednesday on a conference call. He didn’t name the assets.

Battista added that Time is open to joint ventures with other companies and interested in an outside investor who could provide capital “for a particular opportunity.”

Last month, Time announced that it was sticking with its online strategy rather than sell itself after months of negotiations with potential suitors, including Meredith Corp. and a group including Pamplona Capital Management and Jahm Najafi. New York-based Time was said to be holding out for more than $20 a share.

The shares slumped as much as 19% to $12.20 on Wednesday. The magazine publisher reported first-quarter revenue of $636 million, missing the $642 million average of analysts’ estimates. Its net loss widened as print advertising sales declined 21%. It also cut its dividend. Like other magazine publishers, Time is struggling to transform itself as print advertising dries up and the lion’s share of digital advertising dollars goes to Facebook Inc. and Google.

The magazine owner has spent months restructuring its business and replacing senior management, hoping to persuade advertisers to pour money into its magazine titles. This fall, Time plans to introduce a Sports Illustrated online video service with documentaries and insights from the magazine’s reporters, part of its growing push into video. Some of Time’s smaller titles include Sunset magazine and What’s On TV, which is based in the U.K.

Investor challenge

On an earnings conference, one of its investors demanded more detail about Time’s strategic plan.

“You constantly refer to this strategic plan, but you provide no numbers for the shareholders to basically grasp what this company will look like in two or three years,” said Leon Cooperman, of Omega Advisors Inc., which owns 3.9% of the magazine publisher, according to data compiled by Bloomberg.

“I think it’s incumbent upon the company to share with the shareholders, the people that have the money invested, what the strategic plan would yield,” Cooperman said. “Because I’m pretty confident that this company can be sold today at at least $18 a share.”

Cooperman urged the company to hold an analyst day and reveal its strategic plan in more detail. “Then we can make an intelligent decision whether we should agitate for a sale or be patient and give you guys a chance to do your magic,” he said.

Battista replied that Time hired an adviser to cut costs and believes it can reach $1 billion in digital revenue, but did not provide a timeline. In an interview, Battista said the company would provide some profit guidance going forward and “other insights when appropriate.”

“We feel really excited and confident in our plan,” Battista said.

Sourced from CRAIN’S New York Business

By Tobi Elkin.

A new report by ad-tech provider Blue Venn finds that 72% of marketers consider data analysis more important than social media skills.

The report, “Customer Data: The Monster Under the Bed?,” incorporates research from 200 U.S. and U.K. marketers, with the goal of identifying the attributes most needed to compete in the data-centric marketing landscape.

Key findings include:

–Data management is now considered more vital than social media (65%), Web development (31%), graphic design (23%), and search engine optimization (13%).

–However, 27% of marketers are still handing over the process of data analysis to IT departments.

–The focus on understanding and synthesizing customer data is especially strong at large enterprises, where four out of five marketers consider data analysis to be a “vital” skill.

–Data segmentation and modeling are also considered highly sought-after marketing skills, ranking higher than both Web development and graphic design within the enterprise space.

“In the age of big data, marketers have a better opportunity than ever before to truly understand their customers’ decision-making processes. Unfortunately, as it stands, most marketers simply don’t have the time, the knowledge or the tools necessary to undertake this task in a practical and effective way,” stated Anthony Botibol, marketing director at BlueVenn.

By

Sourced from MediaPost