Latest Entries »

De-Duplication

Since I was into de-duplication as a part of my undergraduate thesis , I found the articles from NetApp and EMC very interesting.

De-duplication is the process of removing replicas of a file, the file may be of any type (example .jpg,.txt,.doc etc). There are two major approaches to data de-duplication, one is file level and the other is database level.

Few companies employ file level de-duplication while a majority of them employ block level de-duplication.

Block Level De-duplication :

De-duplication at the data block level compares blocks of data  with other subsequent blocks. Block level deduplication allows you to de-duplicate data within a given object. If an object (file, database, etc.) contains blocks of data that are identical to each other, then block level deduplication eliminates storing the redundant data and reduces the size of the object in storage.

In De-duplication a single copy of the file is maintained and other copies of the file are made into references to that particular file, which drastically reduces the file size in case the redundancies for that particular file is very high. References may be similar to soft link approach used in Linux.

For example if a image file that is 3MB , is stored in 5 different locations the total size occupied by that file is 15MB. In case of data de-duplication a single copy of the file is maintained and the rest of the copies are made as references to the original file location (or) rather single file location. So the result after de-duplication comes way lesser than the former may be just slightly higher than 3MB.

Files may be named differently, this poses a great challenge hence the md5/SHA-1  of the file is calculated and checked for duplicates. Links are established between similar files. For my project I use the Amazon S3 for storing data on the cloud .I found it to be an easy and efficient way of storing and accessing my data. Amazon AWS provides support for various languages like C#, Java and PHP etc. The howto’s are provided under the Developer section of the Amazon AWS website.

The links given below provide some useful resources regarding de-duplication.

http://www.informationweek.com/blog/229205878

http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/134-inline-or-post-process.html/

http://www.evaluatorgroup.com/document/data-de-duplication-%E2%80%93why-when-where-and-how-infostor-article-by-russ-fellows/

And of course Wiki

http://en.wikipedia.org/wiki/Data_deduplication

Since a lot of research is being carried out on how to decrease the storage costs : de-duplication proves to be an effective tool in this regard.

It’s very annoying when a linux box warns you that you are the super user and you may harm your computer by doing this blah blah blah. I was just trying to install google chrome for my Fedora 16. Got the rpm from the google website but unfortunately it didn’t open , the reason “cannot run as root” . To overcome this there’s a small work around.

1. Go to terminal and type

xhost +

The above command disables access control for X11 display.

2. Next open your google chrome configuration file located in /usr/bin/google-chrome and this to the end of the line “–user-data-dir”.

So your config file will look something like this.

export LD_LIBRARY_PATH

export CHROME_VERSION_EXTRA=”stable”

# We don’t want bug-buddy intercepting our crashes. http://crbug.com/24120
export GNOME_DISABLE_CRASH_DIALOG=SET_BY_GOOGLE_CHROME

exec -a “$0” “$HERE/chrome” “$@” –user-data-dir

3. Save the file and quit , you must be up and running.

HTH


Yum

Yum is a package management utility that is used to install packages. The greatest benefit of using yum is that it automatically configures the dependency packages required for the installation and installs them as well. YUM is expanded as YellowDog Updater Modified. It is used in rpm compatible operating systems like Red Hat and Fedora/CentOS. It makes use of XML for storing repository information.

Configuring yum

YUM is stored under /etc under yum.repos.d.

The basic syntax for yum repo is

[name of repository]

baseurl=ftp://mywebsite.com

enabled=0 (or) 1

gpgcheck = 0 (or) 1

Baseurl defines the url from which the packages are to be retrieved. Enabled is whether the current repo is in on or off state. GPGcheck can be 0 or 1, depending on whether a GPG signature check needs to be performed or not.

The baseurl may be defined for HTTP/FTP websites and a few more I assume. The baseurl can be configured to use a local file system as well just use file:///(path)

For eg: baseurl = file:///root

HTH

A few tips on configuring your yum repository.

1. First update your yum repository.

yum update

2. Make sure you clean your repository so that there isn’t much of a mess.

yum clean

3. Once the yum repository is cleaned check for the packages.

yum list

The above command shows the various packages in the repository.

Every time one installs packages make sure to do the above steps

HTH

Installing Broadcom drivers on Fedora 16 is quite easy but sometimes configuring the repositories proves to be a hindrance. First to install broadcom drivers make sure you add rpmfusion repositories to your yum repositories.

1. Go to rpmfusion.org/configuration, select the desired rpm from the list. In my case i chose rpm fusion Non Free for Fedora 14,15,16.

2. Run the rpm file.

3. Make sure your broadcom drivers are present.

yum install b43*

4. Now install kmod-wl

yum install kmod-wl

5. Restart your machine

6. Make sure your kernel header files are up to date.

yum install kernel-PAE-devel kernel-headers

HTH

Internet has become an endless reality, now people can talk to their friends and relatives through video chat, things like these seemed impossible a few decades ago. Internet has greatly revolutionized the world. From payments to social networking, it has his impact on most of the individuals. Without internet, life is hard.

The Internet consists of an infrastructure laid down by hardware devices like Cables, routers, switches (before hub), transmission towers,satellites etc. These form the backbone of the Internet.

The various components include nodes,clients,servers.Some are end points — the computer, smartphone or other device you’re using to read this may count as one. We call those end points clients. Machines that store the information we seek on the Internet are servers. Other elements are nodes which serve as a connecting point along a route of traffic. Connections can be physical or virtual. Moreover we can categorize internet as Wired and Wireless as well.

Now comes the software components: Protocols are the set of rules that nodes and machines in a network follow. Without protocols communication is nearly impossible. They lay down standards and policies that the nodes in the network must follow.

Commonly used protocols on the internet include TCP,UDP,IP,HTTP,FTP.

Now lets concentrate on how packets flow in the internet.

First a connection to the internet is established. We then make use of a Web Browser for viewing web pages.Here, your computer sends an electronic request over your connection to your Internet service provider (ISP). The ISP routes the request to a server further up the chain on the Internet. Eventually, the request will hit a domain name server (DNS).The ISP is the internet provider, example Verizon, Airtel, BSNL.

The DNS forms an important feature of the Internet. It is the one that manages the entire internet. It’s used in redirection look ups apart from many other tasks.This server will search for a match for the domain name you’ve typed in for example (www.google.com). If a match is found it then redirects to the corresponding IP address.For example http://www.google.com will redirect to 216.239.51.99. If it doesn’t find a match, it will send the request further up the chain to a server that has more information.

The request will finally come to our very own Web server. The internet makes use of packets , data is divided into several small data packets and transmitted and received over the internet. Each protocol follows its header and footer formats along with the information that each packet carries. The routing protocol is specified as well. Hence depending on the protocol and the addresses, the packets reach the destination node.

That’s an important feature. Because packets can travel multiple paths to get to their destination, it’s possible for information to route around congested areas on the Internet. In fact, as long as some connections remain, entire sections of the Internet could go down and information could still travel from one section to another — though it might take longer than normal.

Routing is essential as there are several ways to send and receive packets over the internet and its essential to follow the best path and provide alternate paths when necessary.

HTH

 

I was always amazed at the way torrents work that’s why I felt I must write an article about torrents in my blog. Torrents are the most widely used mechanism for downloading files over the internet. Even though they are an innovative thing, they are responsible for most of the piracy that happens over the internet.

Now lets get to the point “How torrents work ?”

Torrents come under the category called Peer to Peer  sharing. P2P file sharing is different from regular downloading. In peer-to-peer sharing, we use a software program to find and connect to computers consist of the file you want to download. Because these are ordinary computers like yours, as opposed to servers, they are called peers.

Few definitions :

  • “Swarming” is about splitting large files into smaller “bits”, and then sharing those bits across a “swarm” of dozens of linked users.
  • “Tracking” is when specific servers help swarm users find each other.
  • Swarm members use special Torrent client software to upload, download, and reconstruct the many file bits into complete usable files.
  • Special .torrent text files act as pointers during this whole process, helping users find other users to swarm with, and enforcing quality control on all shared files.

A torrent makes use of primarily two concepts “seeds” and “peers”.

Every torrent client software contacts  a tracker to find other computers running a torrent client that have the complete file .These are called “seeds” and those with a portion of the file downloaded are called the “peers”.

The tracker in the network makes use of the swarm i.e it identifies the computers that are the seeds and those that are peers.

Torrents make use of simultaneous upload and download i.e a torrent client downloads a part of the file and at the same time uploads the file to be used by other peers in the network. The upload and download rates can be specified in the torrent client.

Download speed is controlled by torrent tracking servers, who monitor all swarm users.I have come across a few articles saying that most torrent clients make use of a strategy called “tit for tat” which means the greater the number of files you upload the better your download speed. Am from India and I have never seen my download speeds crossing the 250 Kbit/sec mark and hence I have little to comment about it.
A quote from netforbeginners.about.com
If you share, tracker servers will reward you by increasing your alotted swarm bandwidth . Similarly, if you leech and limit your upload sharing, tracking servers will choke your download speeds, sometimes to as slow as 1 kilobit per second. Indeed, the “Pay It Forward” philosophy is digitally enforced! Leeches are not welcome in a bittorrent swarm.

For further reading: http://en.wikipedia.org/wiki/BitTorrent_(protocol)

With the increase in the traffic, it gets difficult to get high speed access to resources like data, network bandwidth and other resources. Companies are striving to provide high speed access to its clients and customers at the same time, but the former proves to be hindrance to this. Content Delivery network provides a solution to this. By placing edge servers at various locations on the globe , companies now can provide high speed access by directing the users specific to a region to their corresponding servers.

A quote from Wiki :

The capacity sum of strategically placed servers can be higher than the network backbone capacity. This can result in an impressive increase in the number of concurrent users. For instance, when there is a 10 Gbit/s network backbone and 200 Gbit/s central server capacity, only 10 Gbit/s can be delivered. But when 10 servers are moved to 10 edge locations, total capacity can be 10×10 Gbit/s.

CDN’s are dynamic in nature and they service with the help of TCP and UDP. CDN technologies lay a lot of importance to delivering resources dynamically and this also plays a major role when a particular server fails. CDN now can provide high availability by using other edge servers and there is no lag in transmission of data.

Some of the popular CDN’s include Akamai , Amazon’s Cloud Front and CloudFlare.

It’s very handy to use Hybridfox for handling images in eucalyptus, this article demonstrates on how to manage instances on the cloud using the command line

A key pair needs to be created before logging into a virtual machine on the cloud

euca-add-keypair mykey > mykey.private

Start the VM n is the number of instances that you want to start and emi is the image that you want to run on the cloud

euca-run-instances -k mykey -n (number of instances) (emi)

To query the system regarding the number of images and status of these images use

euca-describe-instances

It is a common issue that some of the download mirrors may be down or be hit by traffic, this method suggests a way to continue downloading the file that you had paused from a alternate mirror (or) url. I use Internet Download Manager and it’s one of the best in the business. To continue your paused download from a different mirror , just get the address to your alternate mirror from the from the source , (AFAIK most of the websites these days provide links for alternate mirrors especially for downloading massive files). Go to IDM , right click the file that was paused, click on Properties and change the address to the address of the new mirror url, now you can resume your download at a higher speed. Choose the mirror that is relatively close to you for better speeds.

HTH