Data Scraping Services
  • Home
  • F.A.Q.
  • Blog
  • Contact Us

Not all bots are bad

August 20, 2018 by Tianna

Image found via PCWorld

Internet bots. We’ve all heard of them. Everyone recognises that they exist among us on the internet, stealing concert tickets, influencing political elections. But do we really understand what these bots actually are?

Essentially, ‘bots’ are internet robots, a compilation of software instructed to creep across the internet and run automated tasks. Normally these bots manage tasks that are simple and repetitive, and they carry them out much at much at larger capacity and speed than any human could.

The main use of bots come in the form of web spiders, which are automated scripts sent to crawl out across the internet and collect, analyse and store information that might be useful to data miners.

In fact, bots are now responsible for 52% of all internet traffic, according to a survey from security firm Radware. However, of these bots, there are good bots, bad bots and neutral bots.

‘Good bots’ can provide beneficial services to both companies and customers, through means such as virtual assistants, search engine indexing, and monitoring website performance. These can include search engine bots, commercial bots, feed fetchers and monitoring bots. Chat bots increase engagement of websites as they allow customers to communicate with companies without the need for a person, consequently allowing for 24/7, unlimited communication. Bots also allow marketers to optimise their search engines and ensure competitive pricing.

Now ‘Bad bots’ are designed with malicious intent, performing automated attacks on websites and computers that cause damage to their victims. These types of bots are used by hackers, cybercriminals and fraudsters to carry out acts such as click fraud, viruses and worms, account takeovers, DDoS attacks, data theft (where they hack private information), etc.

For example, there are notorious bots that buy up all the good seats for concerts only to sell them at higher prices. Some bots are used to increase views on Youtube videos, or to generate more followers on social media platforms. They are even used in online multiplayer games to harvest resources that would normally take lots of time and effort to gain, severely damaging the economy of online gaming. Spambots are used to spam lots of content onto websites, usually in the form of advertisements, often causing poor website performance.

According to the annual Imperva Incapsula Bot Traffic Report, 94.2% of websites have experienced a bot-attack, while Radware has found that 45% of companies surveyed have experienced a data breach in the last year.

‘Neutral bots’ are inoffensive bots that search for public information without any interference on the websites they scrape. They only harvest data that is made public, so it’s already accessible by everyone anyway and doesn’t violate anyone’s privacy. These types of bots are usually run by responsible, experienced data miners who know how to politely extract any required data without having any effect on anyone or breaching any laws.

These bots can be used by organisations to gain data that they can use to improve their company. They are even used to gather data to create maps, such as with Google Maps, to ensure accurate depictions of where all the places you might want to go are located. Neutral bots extract data without any effect on anyone, and are therefore harmless.

We at the Data Group focus on ‘good’ and ‘neutral’ bots. We use bots in the form of web spiders to gather, analyse and store information deemed valuable to our clients. This information can be vital to their projects, such as looking at competitive pricing, or to predict upcoming changes in the market.

Obviously we’re a little biased, but we really believe that not all bots are bad. While some of them are problematic, bots seem to be a big part of our future. With over half of the internet’s traffic coming from bots, it is expected that they – good, bad and neutral – will continue have the biggest impact on marketing.

LinkedIn sued by data-scraping company with unexpected results

January 23, 2018 by Tianna

In mid 2017, LinkedIn was sued by HiQ Labs, an analytics start-up based in San Francisco.

 

Who are HiQ Labs?

HiQ Labs scrapes data of thousands of employees via their public profiles on LinkedIn. They then analyse this data to provide reports for client companies. This allows said companies to monitor the status of their employees and how it may affect their business.

For example, this analysis can inform clients if certain employees have updated their LinkedIn profile in a manner that suggests they plan on resigning in the near future. According to HiQ Labs, this knowledge can be vital, as the loss of “talent” can put “everything they’ve been working on… at risk”.

 

Image found via LinkedIn

Legal spat between LinkedIn and HiQ

LinkedIn claims that HiQ Labs are breaching the US Computer Fraud and Abuse Act prohibition on unauthorised access to computer systems. The company told The Wall Street Journal that “if LinkedIn members knew that hiQ was accessing and collecting their data in this manner, many would not update their profiles”.

LinkedIn believes that HiQ Labs’ actions are “against the law”, and sent a cease-and-desist letter to the company, who responded by deciding to sue.

“We were stunned by LinkedIn’s actions,” Mark Weidick, CEO of HiQ, explains “especially given their long-time familiarity with HiQ’s product and business. I run a company whose very existence is tied to the notion of public data really being equally accessible to all members of the public.”

Weidick goes on to clarify that “LinkedIn’s attempt to wall-off this public information—viewable by anyone with a web browser—is not just a danger to hiQ, but to any company that uses public sources to inform the services they provide.”

Bloomberg columnist Matt Levine comments on the peculiarity of this case, as LinkedIn has published this” exclusive data on a public website, and then accuse[s] the competitors of hacking for reading that website”

“Data analytics on public information is a foundation stone of the modern internet.” HiQ’s lawyers maintain “Without such technologies internet users would be unable to make sense of the billons of web pages that exist in this modern marketplace of ideas”

LinkedIn argues that they have “no idea whether or not a bot may have “good” intentions, or whether it is a malicious actor, such as a hacker seeking to take down the LinkedIn site, a spammer, or an identity thief”

 

Unexpected Outcome

In the lawsuit between LinkedIn and HiQ, the Judge has concluded that LinkedIn must allow HiQ to access profile data.

Judge Edward Chen has a granted a temporary restraining order that prevents LinkedIn from continuing to block HiQ’s efforts to harvest their profile data. The Judge came to this decision with the knowledge that the block directly threatens hiQ’s business, as was proved by evidence that it was being impaired by the block.
The judge maintains that “The court concludes that based on the record presented, the balance of hardships tips sharply in hiQ’s favour”

This data was obtained from public profile pages on LinkedIn, discrediting the argument that HiQ is allegedly in breach of the Computer Fraud and Abuse Act by bypassing an authentication requirement.

The Court had previously explained that they were “doubtful that the Computer Fraud and Abuse Act may be invoked by LinkedIn to punish hiQ for accessing publicly available data”
Regarding their plans to appeal, LinkedIn told Reuters, “This case is not over. We will continue to fight to protect our members’ ability to control the information they make available on LinkedIn.”

Many spectators believe that LinkedIn’s goal is not in fact to protect the privacy of their users, but rather to protect their company from the threat of competitors.

The irony is that, while LinkedIn claims that HIQs actions might cause their users to “not update their profiles”, HiQ would have never gained such publicity without this legal drama between the two companies, warning even more members of such data scraping programs.

HiQ states that “This is a step in the right direction to ensure that any person or company looking to build a business on data analytics of public data may do so.” The company firmly believes that “public data must remain public.”

LinkedIn has since filed an appeal with the Ninth Circuit, asking to vacate the Judge Chen’s ruling ordering LinkedIn to stop blocking HiQ labs from obtaining data from their users’ profiles.

Southwest sues data-scraping website

January 19, 2018 by Tianna

Southwest Airlines sue SWMonkey.com, a website committed to helping Southwest travellers obtain the cheapest airfares.

The website was developed in November by Pavel Yurevic and his partner Chase Roberts. Both understood the toll of overpriced airfares, and developed a platform to help customers save money by purchasing tickets during drops in price.

The website used data scraping methods to alert customers of price drops on flights they had previously booked, wherein which SWMonkey.com would charge $3 if the customer had saved at least $10 through the service.

The website SWMonkey.com is still live —but is not currently providing any services

SWMonkey.com has received “a number of cease and desist letters demanding [they] shut down [their] website ‘immediately’”, and is now facing a lawsuit for computer fraud, violating terms and conditions of Southwest’s website, and violation of trademarks.

Southwest Airlines explains that this data scraping method has created “violations of our website terms and unauthorised use of our trademarks”, resulting in “substantial” and unnecessary traffic to their website.

Yurevich rebuts this statement, explaining to The Dallas Morning News that he feels this situation is “really kind of unfortunate” as they believe themselves not to be “a threat to their business in any way”.

Yurevich and Roberts believe that such services “should be” legal, despite the fact that “Southwest Airlines disagree”, explaining that they “don’t have the money to travel to Texas and challenge them.”

As of January 2018, the legitimacy of data scraping is still unresolved, as can be seen with recent test cases (e.g. LinkedIn vs. hiQ) that have not yet been decided.

Trouble cometh!

Whilst the cost-saving idea was good, it was bound to attract SouthWest’s wrath:

  1. The site targetted SouthWest’s profitability.  No big gorilla is going to sit idly by and let a little chimp take its food.
  2. The scraped data was visible on a public website.
  3. There was no value added to the data, either through conversion or merging it with another source(s).  It was obvious where the data originated from.
  4. The website made it easy to track the owners of the site.
  5. Not sure at this stage how the data harvest was undertaken.  Incorrectly configured crawlers can place a large load on sites which prevent legitimate customers from transacting effectively, and in some cases, inadvertently creating a denial of service attack.

What could have been done?

Whilst we are not lawyers and remind readers that the following does not constitute advice, a better approach would have been to put the following in place BEFORE going live:

  1. Make sure the business is operating as a limited liability company.  This offers the directors and employees a degree of protection from being personally sued.
  2. Registered the domain anonymously.
  3. Hosted the domain in an offshore data centre, in a country with low IP enforcement laws.
  4. Use web-based communication tools to retain anonymity.
  5. Use a professional data-scraper to harvest the data in an untraceble and discrete manner.

This approach would have made litigation difficult and expensive for Southwest and thus, levelling the playing field.

 

Sending

The Data
Scraping Group

The Data Scraping Groups Phone Number  1300 788 662

info@data-group.com.au

6 Somerset Close, Heatherton
Victoria,    3202,    Australia

ABN: 89095802985

Established: 2005

Copyright ©, The Data Scraping Group is a division of Net Assets (Australia) Pty Ltd. Log in