Data Mining   2 comments


  1. What kind of data mining can be performed on spatial databases?
    1. Statistical Spatial data analysis has been a popular approach to analyzing spatial data and exploring geographic information.
    2. A spatial data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of both spatial and non spatial data in support of spatial data mining and spatial-data related decision-making processes.
    3. Spatial data cube and spatial OLAP

     

  2. What do you mean by discovery-driven exploration of data cube?

    In discovery-driven exploration of data cube, the anomalies in the data are automatically detected and marked for the user with visual cues.

    In discovery-driven exploration, pre-computed measures, which indicate data exceptions, are used to guide the user in the data analysis process, at all levels of aggregation.

     

  3. What is meant by loose coupling architecture?

    The Data Mining System will use some facilities of a DB or DW system, fetching data from a data repository managed by these systems, performing data mining, and then storing the mining results either in a file or in a designated place in a database or data warehouse.

     

  4. State the difference between classification and prediction.

Classification 

Prediction 

Predicts categorical class labels (discrete or nominal)

Classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data.

This models continuous-valued functions i.e., it is used to predict missing or unavailable numerical data values rather than class labels.

 

  1. Define decision tree induction.

    Decision tree induction is the learning of decision trees from class-labeled training tuples.

    Decision trees are the basis of several commercial rule induction systems.

    The goal of decision tree induction is that from a set of labeled examples to induce a classifier which can be used to predict correctly classes of unseen instances. The set of rules is represented as decision tree.

     

  2. What is CLARA?

    CLARA (Clustering LARge Applications) is a sampling-based method for partitioning a large database, which deals with larger data sets than PAM.

    Instead of taking the whole set of data into consideration, a small portion of the actual data is chosen as a representative of the data. The complexity of each iteration in CLARA is O(ks2 + k(n-k)), where s is the size of the sample, k is the number of clusters, and n is the total number of objects.

     

  3. What do you mean by interval-scaled variable?

    Interval-scaled variables are continuous measurements of a roughly linear scale. Typical examples include weight and height, latitude and longitude coordinates (e.g., when clustering houses), and weather temperature.

  4. How can we make K-means algorithm more scalable?

    A recent approach to scaling the k-means algorithm is based on the idea of identifying three kinds of regions in data: regions that are compressible, regions that must be maintained in main memory, and regions that are discardable.

    An alternative approach to scaling the k-means algorithm explores the microclustering idea, which first groups nearby objects into “microclusters” and then performs k-means clustering on the microclusters.

     

  5. Distinguish classification from decision tree.

Classification 

Decision Tree 

It is the process of finding a model (function) that describes and distinguishes data classes or concepts, for the purpose of being able to predict the class of objects whose class label is unknown.

A decision tree is a flow-chart like tree structure, where each node denotes a test on an attribute value, each branch represents an outcome of the test, and tree leaves represent classes or class distributions.

The derived model is based on the analysis of a set of training data. 

Decision trees can easily be converted to classification rules.

 

  1. Define Data mining.

    Data mining refers to extracting or “mining” knowledge from large amounts of data.

     

  2. Define data warehouse.

    A data warehouse is a repository of information collected from multiple sources, stored under a unified schema, and that usually resides at a single site.

    Data warehouses are constructed via a process of data cleaning, data integration, data transformation, data loading, and periodic data refreshing.

     

  3. Mention the uses of data warehousing during data mining

    Data warehouse facilitates decision making.

    The data provides information from a historical perspective.

    Data warehouse is modeled by a multidimensional database structure, which generates data cube that allows the pre-computation and fast accessing of summarized data.

     

  4. What is meant by Apex Cuboid?

    A cube at the highest level of abstraction is the apex cuboid.

    The 0-D cuboid, which holds the highest level of summarization, is called the apex cuboid.

    The apex cuboid, or 0-D cuboid refers to the case where the group-by is empty.

    The apex cuboid is typically denoted by all.

     

  5. State different types of concept hierarchies

    A concept hierarchy defines a sequence of mapping from a set of low-level concepts to higher-level

    Schema hierarchy – A concept hierarchy that is a total or partial order among attributes in a database schema is called a schema hierarchy.

    Set-grouping hierarchy – Concept hierarchies may also be defined by discretizing or grouping values for a given dimension or attribute, resulting in a set-grouped hierarchy.

  6. Define virtual warehouse.

    A virtual warehouse is a set of views over operational databases. For efficient query processing, only some of the possible summary views may be materialized. A virtual warehouse is easy to build but requires excess capacity on operational database servers.

     

  7. Define data cube with example.

    A data cube allows data to be modeled and viewed in multiple dimensions. It is defined by dimensions and facts.

    Dimensions are the entities with respect to which an organization wants to keep records.

    Facts are numerical measures.

    Example: Page No. 113

     

  8. What is meant by data integration?

    Data integration combines data from multiple sources into a coherent data store, as in data warehousing. These sources may include multiple databases, data cubes or flat files.

     

  9. State the difficulties of hierarchical clustering

    The quality of pure hierarchical clustering method suffers from its inability to perform adjustment once a merge or split decision has been executed.

    That is, if a particular merge or split decision later turns out to have been a poor choice, the method cannot backtrack and correct it.

     

  10. What are the requirements of clustering in data mining?
    1. Scalability
    2. Ability to deal with different types of attributes
    3. Discovery of clusters with arbitrary shape
    4. Minimal requirements for domain knowledge to determine input parameters
    5. Ability to deal with noisy data
    6. Incremental clustering and insensitivity to the order of input records
    7. High dimensionality
    8. Constraint-based clustering
    9. Interpretability and usability

     

  11. What is meant by Five-number summary?

    Interquartile range (IQR) = Q3 – Q1        Here Q1, the median, and Q3 together contain no information about the endpoints (e.g., tails) of the data, a fuller summary of the shape of a distribution can be obtained by providing the lowest and highest data values as well. This knows as the five-number summary. This can be written in the order Minimum, Q1, Median, Q3, Maximum.

     

  12. State the use of Meta rule.

    Meta rules allow users to specify the syntactic form of rules that they are interested in mining.

    The rule forms of can be used as constraints to help improve the efficiency of the mining process.

    Meta rules may be based on the analyst’s experience, expectations or intuition regarding the data or may be automatically generated based on the database schema.

  13. What is backpropagation?

    Backpropagation is a neural network learning algorithm. The field of neural networks was originally kindled by psychologists and neurobiologist who sought to develop and test computational analogues of neurons.

     

  14. How interestingness measure & threshold are specified in DMQL?

    Interestingness measures and thresholds can be specified by the user with the statement

    Syntax : with {(interest_measure_name)} threshold = {threshold_value}

    Example:

    with support threshold = 5%

    with confidence threshold = 70%

    The interestingness measures and threshold values can be set and modified interactively.

     

  15. Define Iceberg Query.

    An iceberg cube is a data cube that stores only those cube cells whose aggregate value (e.g., count) is above some minimum support threshold.

    compute cube sales_iceberg as

        select month, city, customer_group, count(*)

        from salesInfo

        cube by month, city, customer_group

        having count(*) >= min_sup

     

  16. What are Bayesian Classifiers?

    Bayesian classifiers are statistical classifiers. They can predict class membership probabilities, such as the probability that a given tuple belongs to a particular class.

    This is based on Bayes’ theorem.

    P(H|X) =

     

  17. State any two commercial data mining systems.
    1. From database system and graphics system vendors
      1. IBM’s Intelligent Miner
      2. Microsoft SQL Server 2005
      3. MineSet from Purple Insight
      4. Oracle Data Mining (ODM)
    2. From Vendors of statistical analysis or data mining software
      1. Clementine from SPSS
      2. Enterprise Miner from SAS Institute
      3. Insightful Miner from Insightful
    3. Originating from the machine learning community
      1. CART from Salford Systems
      2. See5 and C5.0 from Ruble Quest
      3. Weka developed by University of Waikato

     

  18. Mention various constraints used for association mining

    Five categories of constraints.

    1. Antimonotonic
    2. Monotonic
    3. Succinct
    4. Convertible
    5. Inconvertible

 

  1. What is polysemy problem?

    The main problem of lexical semantics, or word meaning, is that the meanings of individual lexemes are highly diverse. This is called as the problem of polysemy.

    In simple words, a single form can have two or more related meanings.

     

  2. What is meant by sequential pattern mining?

    SPAM is a new algorithm for finding all frequent sequences within a transactional database.

    Sequential pattern mining is the mining of frequently occurring ordered events or subsequences as patterns.

    Example: Customers who buy a Canon digital camera are likely to buy an HP color printer within a month.

     

  3. What is a distance based outlier?

    The notion of distance-based outliers was introduced to counter the main limitations imposed by statistical methods.

    In this approach, one looks at the local neighborhood of points for an example typically defined by the k nearest examples (also known as neighbors).

     

  4. What is meant by similarity search?

    Unlike normal database queries, which find data that match the given query exactly, a similarity search finds data sequences that differ only slightly from the given query sequence.

    There are two types of similarity search: (a) subsequence matching (b) whole sequence matching

     

  5. What is meant by intelligent query answering?

    Intelligent query answering analyzes the user’s intent and answers queries in an intelligent way.

    Intelligent query answering consists of analyzing the intent of the query and providing generalized, neighborhood, or associated information relevant to the query.

     

  6. State the feature of MineSet

    A distinguishing feature of MineSet is its set of robust graphics tools, including rule visualizer, tree visualizer , map visualizer, and scatter visualizer (multidimensional data) for the visualization of data and data mining results.

     

  7. What are the limitations of COBWEB?
    1. It is based on the assumption that probability distributions on separate attributes are statistically independent of one another. But this assumption is not always true.
    2. The probability distribution representation of clusters makes it quite expensive to update and store the clusters.
    3. The classification tree is not height-balanced for skewed input data, which may cause the time and space complexity to degrade dramatically.

     

  8. Distinguish classification tree from decision tree.

Classification Tree 

Decision Tree 

Each node in classification tree refers to a concept and contains a probabilistic description of that concept, which summarizes the objects classified under the node. (Fig. – Page No. 432)

A decision tree is a flow-chart like tree structure, where each node denotes a test on an attribute value, each branch represents an outcome of the test, and tree leaves represent classes or class distributions.

The probabilistic distribution includes the probability of the concept and conditional probabilities of the form P(Ai=vij|Ck), where Ai = vij is an attribute-value pair and Ck is the concept class.

The nodes are logical descriptor rather than probabilistic descriptors. Decision trees can easily be converted to classification rules.

 

  1. Mention any two significances of web mining.

    Web mining - is the application of data mining techniques to discover patterns from the Web.

    Web mining typically addresses semi structured or unstructured data, like web and log files with mixed knowledge involving multimedia, flow data, et., often represented by imprecise or incomplete information.

    Web mining widely used to search content in World Wide Web.

 

2 responses to Data Mining

Subscribe to comments with RSS.

  1. very useful questions…expecting the same for other subjects such as mobile computing ,c#,information system audit,middleware…

  2. thanq vary much for providing such a valuable information about middle ware.
    i’m expecting the same amount of information in other decilines.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.